How to Predict PM2.5 using MAIAC Aerosol Optical Depth Data

We'll show you how to preprocess HDF4 files and work with a MAIAC AOD data product to estimate air quality around the world!

Tammy Glazer
Guest writer
Christine Chung
Guest writer

How to predict PM2.5 using MAIAC Aerosol Optical Depth data

Welcome to the getting started notebook for the particulate track of the NASA Airathon: Predict Air Quality challenge! The goal of this tutorial is to provide some helpful pointers and starter code to ingest, preprocess, explore, and work with one of the atmospheric data products derived with data from the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument onboard the Terra and Aqua satellites. The specific scientific parameter covered by this tutorial is called Aerosol Optical Depth (AOD). AOD is derived from an algorithm called Multi-Angle Implementation of Atmospheric Correction (MAIAC) and is made available in Hierarchical Data Format (HDF). If MODIS, MAIAC, AOD, PM2.5, or HDF are new concepts to you, then read on!

Before getting started, check out the Airathon competition homepage to learn more about the goal of this important challenge. In this tutorial, we'll:

Let's get started!

The Data

What is MAIAC?

Terra and Aqua are NASA scientific research satellites that orbit the Earth to study Earth's various natural systems. Terra passes from north to south across the equator in the morning, while Aqua passes from south to north over the equator in the afternoon. Their orbits are sun-synchronous, meaning that they pass the equator at the same local solar time on every orbit.

A sun-synchronous orbit allows consistent scientific observations with the angle between the Sun and the Earth's surface remaining relatively constant. Credit: NASA Earth Observatory

MODIS is the name of two remote sensing instruments administered by NASA aboard the Terra and Aqua satellites. MODIS has provided Earth observation data in 36 spectral frequencies of light at 250 meters to 1 km resolution since the launch of Terra in 1999. Because of its extensive coverage, MODIS enables researchers to monitor activities spanning from active fires, smoke, and dust storms to changes in land use. Together, the two MODIS instruments are able to capture the entire Earth's surface every one to two days.

MAIAC is an algorithm that combines pixel- and image-based processing to improve the accuracy of the aerosol, cloud, and atmospheric data retrieved from MODIS. The algorithm translates MODIS Level 1B measurements to a fixed grid at 1 km resolution in order to observe the same grid cell over time. This correction allows us to work with polar-orbiting observations as if they were geostationary and makes it easier to characterize surface backgrounds and conduct time series analyses. MAIAC provides a suite of aerosol and land surface reflectance products in HDF4 format.

This exercise will focus on the MCD19A2 Version 6 gridded Level 2 product. Level 2 products represent geophysical variables derived from raw instrument and payload data (e.g., Level 1B).

What is AOD?

MCD19A2 reports Aerosol Optical Depth (AOD) collected daily at 1 km resolution. AOD, also called Aerosol Optical Thickness (AOT), is a unitless measure of aerosols distributed within a column of air from Earth's surface to the top of the atmosphere. AOD has been found to be helpful for estimating the presence of harmful air pollutants including PM2.5, or particulate matter that's less than 2.5 micrometers (µm) in diameter. AOD and PM2.5 are a critical air quality measures because particulates can penetrate deep into the lungs, increasing the risk of health complications like lower respiratory infections or heart disease.

Size comparisons for particulate matter, which include a mixture of solid particles and liquid droplets found in the air. Credit: U.S. Environmental Protection Agency

While the MCD19A2 data product contains a variety of Science Dataset (SDS) layers, we will focus on blue band AOD at 0.47 µm. It's worth noting that MAIAC also uses this value and regional aerosol models to compute green band AOD at 0.55 µm. You are encouraged to explore and leverage all of the layers made available by this data product using the MAIAC Data User Guide.

Data loading

What's in our metadata?

First, let's organize our local repository using Cookiecutter Datascience and download the grid_metadata.csv, pm25_satellite_metadata.csv, submission_format.csv, and train_labels.csv files from the data download page to our data/raw folder.

$ tree data/
data/
├── external
├── interim
├── processed
└── raw
    ├── grid_metadata.csv
    ├── pm25_satellite_metadata.csv
    ├── submission_format.csv
    └── train_labels.csv

4 directories, 4 files

Next, let's import some handy tools for working with geospatial data and load the metadata to see what we have.

In [1]:
from pathlib import Path
import random
from typing import Dict, List, Union

from cloudpathlib import S3Path
import geopandas as gpd
import pandas as pd
import rasterio

DATA_PATH = Path.cwd().parent / "data"
RAW = DATA_PATH / "raw"
INTERIM = DATA_PATH / "interim"
In [2]:
pm_md = pd.read_csv(
    RAW / "pm25_satellite_metadata.csv",
    parse_dates=["time_start", "time_end"],
    index_col=0,
)

grid_md = pd.read_csv(RAW / "grid_metadata.csv", index_col=0)

Satellite metadata

Our particulate track satellite metadata file contains information about MAIAC and MISR data made available for model training and testing. Since we'll be focusing on preparing MAIAC training data, let's create and inspect a subset.

In [3]:
pm_md.columns
Out[3]:
Index(['time_start', 'time_end', 'product', 'location', 'split', 'us_url',
       'eu_url', 'as_url', 'cksum', 'granule_size'],
      dtype='object')
In [4]:
pm_md["split"].value_counts()
Out[4]:
train    5048
test     2673
Name: split, dtype: int64
In [5]:
pm_md["product"].value_counts()
Out[5]:
maiac    6704
misr     1017
Name: product, dtype: int64
In [6]:
maiac_md = pm_md[(pm_md["product"] == "maiac") & (pm_md["split"] == "train")].copy()
maiac_md.shape
Out[6]:
(4260, 10)
In [7]:
maiac_md.head(3)
Out[7]:
time_start time_end product location split us_url eu_url as_url cksum granule_size
granule_id
20180201T191000_maiac_la_0.hdf 2018-02-01 17:25:00+00:00 2018-02-01 19:10:00+00:00 maiac la train s3://drivendata-competition-airathon-public-us... s3://drivendata-competition-airathon-public-eu... s3://drivendata-competition-airathon-public-as... 911405771 10446736
20180202T195000_maiac_la_0.hdf 2018-02-02 18:05:00+00:00 2018-02-02 19:50:00+00:00 maiac la train s3://drivendata-competition-airathon-public-us... s3://drivendata-competition-airathon-public-eu... s3://drivendata-competition-airathon-public-as... 2244451908 11090180
20180203T203000_maiac_la_0.hdf 2018-02-03 17:10:00+00:00 2018-02-03 20:30:00+00:00 maiac la train s3://drivendata-competition-airathon-public-us... s3://drivendata-competition-airathon-public-eu... s3://drivendata-competition-airathon-public-as... 3799527997 12468482

As MODIS sweeps along an orbit, it collects data that is sectioned into pieces called granules. Each MAIAC granule contains 13 SDS layers and has a size of 1200 x 1200 pixels representing 1200 x 1200 km. Each granule is identified by a unique granule_id with the following naming convention: {time_end}_{product}_{location}_{number}.hdf, where time_end is formatted as YYYYMMDDHHmmss. When locations and times have more than one associated file, this is denoted by number. For additional information about each of these fields, check out the PM2.5 data download instructions.

We can see that there are 4,260 MAIAC files to use for training. Each file corresponds with one of the three competition locations: Los Angeles South Coast Air Basin, Delhi, or Taipei. Let's take a look at the distribution of granules by location.

In [8]:
maiac_md.location.value_counts()
Out[8]:
tpe    2130
la     1065
dl     1065
Name: location, dtype: int64

Taipei has twice as many associated files as Los Angeles or Delhi. This is to be expected, because two separate orbit tracks bisect Taipei, i.e., one track covers half of Taipei and another covers the other half.

We can also take a look at the time distribution in our training data.

In [9]:
maiac_md.time_end.dt.year.value_counts()
Out[9]:
2020    1464
2019    1460
2018    1336
Name: time_end, dtype: int64
In [10]:
maiac_md.time_end.dt.month.value_counts().sort_index()
Out[10]:
1     248
2     340
3     372
4     360
5     372
6     360
7     372
8     372
9     360
10    372
11    360
12    372
Name: time_end, dtype: int64

We have a roughly balanced distribution of files across years (2018-2020) and months, with the fewest granules from the month of January.

In [11]:
(maiac_md.time_end - maiac_md.time_start).describe()
Out[11]:
count                         4260
mean     0 days 01:27:16.549295774
std      0 days 00:55:30.276180728
min                0 days 00:05:00
25%                0 days 00:10:00
50%                0 days 01:45:00
75%                0 days 01:45:00
max                0 days 14:20:00
dtype: object

A single granule contains data that was captured within a time window spanning from 5 minutes to 14+ hours, with an average timeframe of around 1.5 hours.

Grid metadata

Next, let's take a look our grid cell metadata.

In [12]:
grid_md.head(3)
Out[12]:
location tz wkt
grid_id
1X116 Taipei Asia/Taipei POLYGON ((121.5257644471362 24.97766123020391,...
1Z2W7 Delhi Asia/Calcutta POLYGON ((77.30453178416276 28.54664454217707,...
3S31A Los Angeles (SoCAB) Etc/GMT+8 POLYGON ((-117.9338248256995 33.79558357488509...
In [13]:
grid_md.shape
Out[13]:
(54, 3)
In [14]:
grid_md.location.value_counts()
Out[14]:
Delhi                  33
Los Angeles (SoCAB)    14
Taipei                  7
Name: location, dtype: int64

Each grid cell is identified by a 5-character alphanumeric grid_id that uniquely identifies a 5 x 5 km grid cell and is associated with an observation in our train or test set. This file also contains important location and timezone (tz) information that can be used to localize dates, along with the polygonal coordinates of the grid cell.

We can see that there is metadata available for 54 grid cells.

How do I load a MAIAC HDF4 file?

Our MAIAC data is stored in a HDF4 file format. At its lowest level, Hierarchial Data Format (HDF) is a set of file formats (HDF4 and HDF5) designed to store and organize large amounts of scientific data. HDF4 is the older version of the format, but is still actively supported by a non-profit corporation called The HDF Group. This file format supports different data models, including multidimensional arrays, raster images, and tables. A single HDF file can hold a mix of related objects that can be accessed as a group or as individual objects.

In Python, we can read in HDF4 files using one of three libraries:

  1. GDAL
  2. rioxarray
  3. PyHDF

For this example, we'll be using PyHDF and code adapted from a script on the HDFEOS zoo. The SD module of the pyhdf package supports scientific datasets. Let's import this module along with some helpful tools to get started.

In [15]:
import os
import re

import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pyhdf.SD import SD, SDC, SDS
import pyproj

mpl.rcParams["figure.dpi"] = 100

The MAIAC metadata file contains the locations of each MAIAC HDF4 file on s3. We can use the S3Path class from the cloudpathlib package to generate a path to a sample data file from Los Angeles.

In [16]:
la_file = maiac_md[maiac_md.location == "la"].iloc[0]
la_file_path = S3Path(la_file.us_url)
la_file_path
Out[16]:
S3Path('s3://drivendata-competition-airathon-public-us/pm25/train/maiac/2018/20180201T191000_maiac_la_0.hdf')

To read this file, we can instantiate an SD object using a string representation of our path along with a file opening mode of SDC.READ. Note that calling fspath downloads the file locally to a temporary directory.

In [17]:
hdf = SD(la_file_path.fspath, SDC.READ)
hdf.info()
Out[17]:
(13, 8)
In [18]:
Path(la_file_path.fspath).unlink()

Our hdf object contains 13 datasets or bands and 8 attributes. We are most interested in the dataset related to blue band AOD at 0.47 µm (Optical_Depth_047).

We can list the 13 available datasets by calling the datasets method on hdf.

In [19]:
for dataset, metadata in hdf.datasets().items():
    dimensions, shape, _, _ = metadata
    print(f"{dataset}\n    Dimensions: {dimensions}\n    Shape: {shape}")
Optical_Depth_047
    Dimensions: ('Orbits:grid1km', 'YDim:grid1km', 'XDim:grid1km')
    Shape: (4, 1200, 1200)
Optical_Depth_055
    Dimensions: ('Orbits:grid1km', 'YDim:grid1km', 'XDim:grid1km')
    Shape: (4, 1200, 1200)
AOD_Uncertainty
    Dimensions: ('Orbits:grid1km', 'YDim:grid1km', 'XDim:grid1km')
    Shape: (4, 1200, 1200)
FineModeFraction
    Dimensions: ('Orbits:grid1km', 'YDim:grid1km', 'XDim:grid1km')
    Shape: (4, 1200, 1200)
Column_WV
    Dimensions: ('Orbits:grid1km', 'YDim:grid1km', 'XDim:grid1km')
    Shape: (4, 1200, 1200)
AOD_QA
    Dimensions: ('Orbits:grid1km', 'YDim:grid1km', 'XDim:grid1km')
    Shape: (4, 1200, 1200)
AOD_MODEL
    Dimensions: ('Orbits:grid1km', 'YDim:grid1km', 'XDim:grid1km')
    Shape: (4, 1200, 1200)
Injection_Height
    Dimensions: ('Orbits:grid1km', 'YDim:grid1km', 'XDim:grid1km')
    Shape: (4, 1200, 1200)
cosSZA
    Dimensions: ('Orbits:grid5km', 'YDim:grid5km', 'XDim:grid5km')
    Shape: (4, 240, 240)
cosVZA
    Dimensions: ('Orbits:grid5km', 'YDim:grid5km', 'XDim:grid5km')
    Shape: (4, 240, 240)
RelAZ
    Dimensions: ('Orbits:grid5km', 'YDim:grid5km', 'XDim:grid5km')
    Shape: (4, 240, 240)
Scattering_Angle
    Dimensions: ('Orbits:grid5km', 'YDim:grid5km', 'XDim:grid5km')
    Shape: (4, 240, 240)
Glint_Angle
    Dimensions: ('Orbits:grid5km', 'YDim:grid5km', 'XDim:grid5km')
    Shape: (4, 240, 240)

Dataset attributes

To load the Optical_Depth_047 dataset, we'll use the select method on our hdf object.

In [20]:
blue_band_AOD = hdf.select("Optical_Depth_047")
In [21]:
name, num_dim, shape, types, num_attr = blue_band_AOD.info()

print(
    f"""Dataset name: {name}
Number of dimensions: {num_dim}
Shape: {shape}
Data type: {types}
Number of attributes: {num_attr}"""
)
Dataset name: Optical_Depth_047
Number of dimensions: 3
Shape: [4, 1200, 1200]
Data type: 22
Number of attributes: 6

According to our metadata, the blue band AOD data contains 3 dimensions. The first dimension represents the number of orbit overpasses or timestamps contained in the file (four in Los Angeles), while the second and third dimensions represent the height and width of a single band, i.e., 1200 x 1200 pixels. There are more overpasses at the poles (up to 30) and fewer at the equator (1-2). At high latitudes, only the first 16 orbits with the largest coverage are selected for processing per day. A data type of 22 represents a signed 16-bit integer according to the SDC constants.

To access the raw data, simply call get on the AOD blue band dataset object.

In [22]:
blue_band_AOD.get()
Out[22]:
array([[[-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        ...,
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672]],

       [[-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        ...,
        [    21,     21,     21, ..., -28672, -28672, -28672],
        [    22,     21,     21, ..., -28672, -28672, -28672],
        [    24,     22,     21, ..., -28672, -28672, -28672]],

       [[-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        ...,
        [    73,     65,     53, ..., -28672, -28672, -28672],
        [    62,     57,     51, ..., -28672, -28672, -28672],
        [    56,     48,     41, ..., -28672, -28672, -28672]],

       [[-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        ...,
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672],
        [-28672, -28672, -28672, ..., -28672, -28672, -28672]]],
      dtype=int16)

In order to work with the data, we'll need to calibrate the raw values by applying a set of accompanying attributes, including the following:

  • scale factor
  • offset
  • missing data value
  • valid value range

Let's save these values for later.

In [23]:
calibration_dict = blue_band_AOD.attributes()
calibration_dict
Out[23]:
{'long_name': 'AOD at 0.47 micron',
 'scale_factor': 0.001,
 'add_offset': 0.0,
 'unit': 'none',
 '_FillValue': -28672,
 'valid_range': [-100, 5000]}

HDF file attributes

We'll also need to save a few attributes from the file's metadata to align and reproject our data into a usable representation. More specifically, we'll need:

  • upper left coordinate of the granule
  • lower right coordinate of the granule
  • projection
  • projection parameters

All of this information can be parsed from an attribute on hdf called StructMetadata.0.

In [24]:
list(hdf.attributes().keys())
Out[24]:
['HDFEOSVersion',
 'StructMetadata.0',
 'Orbit_amount',
 'Orbit_time_stamp',
 'CoreMetadata.0',
 'ArchiveMetadata.0',
 'identifier_product_doi',
 'identifier_product_doi_authority']
In [25]:
raw_attr = hdf.attributes()["StructMetadata.0"]
print(raw_attr)
GROUP=SwathStructure
END_GROUP=SwathStructure
GROUP=GridStructure
	GROUP=GRID_1
		GridName="grid1km"
		XDim=1200
		YDim=1200
		UpperLeftPointMtrs=(-11119505.196667,4447802.078667)
		LowerRightMtrs=(-10007554.677000,3335851.559000)
		Projection=GCTP_SNSOID
		ProjParams=(6371007.181000,0,0,0,0,0,0,0,0,0,0,0,0)
		SphereCode=-1
		GridOrigin=HDFE_GD_UL
		GROUP=Dimension
			OBJECT=Dimension_1
				DimensionName="Orbits"
				Size=4
			END_OBJECT=Dimension_1
		END_GROUP=Dimension
		GROUP=DataField
			OBJECT=DataField_1
				DataFieldName="Optical_Depth_047"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_1
			OBJECT=DataField_2
				DataFieldName="Optical_Depth_055"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_2
			OBJECT=DataField_3
				DataFieldName="AOD_Uncertainty"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_3
			OBJECT=DataField_4
				DataFieldName="FineModeFraction"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_4
			OBJECT=DataField_5
				DataFieldName="Column_WV"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_5
			OBJECT=DataField_6
				DataFieldName="AOD_QA"
				DataType=DFNT_UINT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_6
			OBJECT=DataField_7
				DataFieldName="AOD_MODEL"
				DataType=DFNT_UINT8
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_7
			OBJECT=DataField_8
				DataFieldName="Injection_Height"
				DataType=DFNT_FLOAT32
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_8
		END_GROUP=DataField
		GROUP=MergedFields
		END_GROUP=MergedFields
	END_GROUP=GRID_1
	GROUP=GRID_2
		GridName="grid5km"
		XDim=240
		YDim=240
		UpperLeftPointMtrs=(-11119505.196667,4447802.078667)
		LowerRightMtrs=(-10007554.677000,3335851.559000)
		Projection=GCTP_SNSOID
		ProjParams=(6371007.181000,0,0,0,0,0,0,0,0,0,0,0,0)
		SphereCode=-1
		GridOrigin=HDFE_GD_UL
		GROUP=Dimension
			OBJECT=Dimension_1
				DimensionName="Orbits"
				Size=4
			END_OBJECT=Dimension_1
		END_GROUP=Dimension
		GROUP=DataField
			OBJECT=DataField_1
				DataFieldName="cosSZA"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_1
			OBJECT=DataField_2
				DataFieldName="cosVZA"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_2
			OBJECT=DataField_3
				DataFieldName="RelAZ"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_3
			OBJECT=DataField_4
				DataFieldName="Scattering_Angle"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_4
			OBJECT=DataField_5
				DataFieldName="Glint_Angle"
				DataType=DFNT_INT16
				DimList=("Orbits","YDim","XDim")
			END_OBJECT=DataField_5
		END_GROUP=DataField
		GROUP=MergedFields
		END_GROUP=MergedFields
	END_GROUP=GRID_2
END_GROUP=GridStructure
GROUP=PointStructure
END_GROUP=PointStructure
END

As we saw above, Optical_Depth_47 is available at 1 km resolution, so we'll use the first group of metadata (i.e., GridName="grid1km") and convert the relevant fields into a dictionary.

In [26]:
# Construct grid metadata from text blob
group_1 = raw_attr.split("END_GROUP=GRID_1")[0]
hdf_metadata = dict([x.split("=") for x in group_1.split() if "=" in x])

# Parse expressions still wrapped in apostrophes
for key, val in hdf_metadata.items():
    try:
        hdf_metadata[key] = eval(val)
    except:
        pass

hdf_metadata
Out[26]:
{'GROUP': 'MergedFields',
 'END_GROUP': 'MergedFields',
 'GridName': 'grid1km',
 'XDim': 1200,
 'YDim': 1200,
 'UpperLeftPointMtrs': (-11119505.196667, 4447802.078667),
 'LowerRightMtrs': (-10007554.677, 3335851.559),
 'Projection': 'GCTP_SNSOID',
 'ProjParams': (6371007.181, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
 'SphereCode': -1,
 'GridOrigin': 'HDFE_GD_UL',
 'OBJECT': 'DataField_8',
 'DimensionName': 'Orbits',
 'Size': 4,
 'END_OBJECT': 'DataField_8',
 'DataFieldName': 'Injection_Height',
 'DataType': 'DFNT_FLOAT32',
 'DimList': ('Orbits', 'YDim', 'XDim')}

Now that our metadata is parsed, we can save the attributes we'll need to align and reproject our data.

In [27]:
# Note that coordinates are provided in meters
alignment_dict = {
    "upper_left": hdf_metadata["UpperLeftPointMtrs"],
    "lower_right": hdf_metadata["LowerRightMtrs"],
    "crs": hdf_metadata["Projection"],
    "crs_params": hdf_metadata["ProjParams"],
}
alignment_dict
Out[27]:
{'upper_left': (-11119505.196667, 4447802.078667),
 'lower_right': (-10007554.677, 3335851.559),
 'crs': 'GCTP_SNSOID',
 'crs_params': (6371007.181, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)}

Data processing

Now that we have our calibration_dict, alignment_dict, and raw data (blue_band_AOD), we have everything we need to calibrate and map AOD values to the competition grid cells.

How do I calibrate AOD data?

To calibrate our data, we'll loop through each orbit and apply a set of corrections based on the information contained in the calibration_dict:

  1. remove values that fall outside of the valid_range (-100 to 5000)
  2. remove missing data (indicated by a _FillValue of -28672)
  3. remove any offset (none in this case) and multiply by the scale_factor (0.0001)
  4. cast the data to a numpy masked array to handle missing entries

Note: Values outside of the valid range can't be used because they fall outside of the scope of the MODIS sensor, and the fill value donates low quality readings. This can happen for a number of reasons, including cloud cover or reflectance from snow or water. In the following code we'll consider both invalid and convert them to nans.

In [28]:
# Loop over orbits to apply the attributes
def calibrate_data(dataset: SDS, shape: List[int], calibration_dict: Dict):
    """Given a MAIAC dataset and calibration parameters, return a masked
    array of calibrated data.

    Args:
        dataset (SDS): dataset in SDS format (e.g. blue band AOD).
        shape (List[int]): dataset shape as a list of [orbits, height, width].
        calibration_dict (Dict): dictionary containing, at a minimum,
            `valid_range` (list or tuple), `_FillValue` (int or float),
            `add_offset` (float), and `scale_factor` (float).

    Returns:
        corrected_AOD (np.ma.MaskedArray): masked array of calibrated data
            with a fill value of nan.
    """
    corrected_AOD = np.ma.empty(shape, dtype=np.double)
    for orbit in range(shape[0]):
        data = dataset[orbit, :, :].astype(np.double)
        invalid_condition = (
            (data < calibration_dict["valid_range"][0])
            | (data > calibration_dict["valid_range"][1])
            | (data == calibration_dict["_FillValue"])
        )
        data[invalid_condition] = np.nan
        data = (data - calibration_dict["add_offset"]) * calibration_dict["scale_factor"]
        data = np.ma.masked_array(data, np.isnan(data))
        corrected_AOD[orbit, ::] = data
    corrected_AOD.fill_value = np.nan
    return corrected_AOD
In [29]:
corrected_AOD = calibrate_data(blue_band_AOD, shape, calibration_dict)
corrected_AOD
Out[29]:
masked_array(
  data=[[[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]],

        [[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [0.021, 0.021, 0.021, ..., --, --, --],
         [0.022, 0.021, 0.021, ..., --, --, --],
         [0.024, 0.022, 0.021, ..., --, --, --]],

        [[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [0.073, 0.065, 0.053, ..., --, --, --],
         [0.062, 0.057, 0.051000000000000004, ..., --, --, --],
         [0.056, 0.048, 0.041, ..., --, --, --]],

        [[--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         ...,
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --],
         [--, --, --, ..., --, --, --]]],
  mask=[[[ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         ...,
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True]],

        [[ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         ...,
         [False, False, False, ...,  True,  True,  True],
         [False, False, False, ...,  True,  True,  True],
         [False, False, False, ...,  True,  True,  True]],

        [[ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         ...,
         [False, False, False, ...,  True,  True,  True],
         [False, False, False, ...,  True,  True,  True],
         [False, False, False, ...,  True,  True,  True]],

        [[ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         ...,
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True],
         [ True,  True,  True, ...,  True,  True,  True]]],
  fill_value=nan)
In [30]:
pd.DataFrame(corrected_AOD.ravel(), columns=["AOD"]).describe()
Out[30]:
AOD
count 1.203568e+06
mean 9.204344e-02
std 6.629422e-02
min 0.000000e+00
25% 4.500000e-02
50% 7.200000e-02
75% 1.220000e-01
max 5.940000e-01

How do I align AOD data with coordinates?

Using our alignment_dict, we can now align each corrected AOD pixel with grid coordinates. From there, we will be able to determine which AOD readings overlap with our training grid cells.

We will follow four steps to map values in a 2-dimensional matrix to 3-dimensional "world" coordinates. This process is known as georeferencing:

  1. identify the original coordinate reference system (CRS) of the dataset
  2. define reference points on our grid
  3. perform any necessary reprojections of these points
  4. connect points to AOD values

We can choose to use one of two georeferencing approaches: vector- or raster-based. You may choose to use one or the other based on computational performance, memory usage, and/or library preferences.

Vector data is composed of discrete geometric locations, or vertices, that define the shape of a spatial object (e.g. points, lines, or polygons). Each feature can carry multiple attributes about a location. Geospatial vector data can be used with the convenient geopandas.GeoDataFrame class.

Raster data is stored as a grid of values that are rendered on a map as pixels, where each pixel value represents an area on the Earth's surface. Raster data is useful for representing continuous surfaces at high levels of detail and for performing efficient geometric computations. Rasters are commonly saved as GeoTIFFs, which contain geographic metadata.

Vector-based approach

Step 1: A CRS defines the mathematical approach used to transform real locations on the Earth into a flattened map. The CRS of our data is the sinusoidal projection according to our alignment_dict (GCTP_SNSOID). As an equal-area map projection, the sinusoidal projection preserves area while distorting shapes.

The sinusoidal projection represents the poles as points, but the meridians and contents are distorted. Credit: Wikimedia Commons

Step 2: Next, using the coordinates of the granule's upper_left and lower_right coordinates, we'll create a rectangular mesh grid of 1200 x 1200 evenly spaced points in the same projection as these coordinates.

In [31]:
shape
Out[31]:
[4, 1200, 1200]
In [32]:
alignment_dict
Out[32]:
{'upper_left': (-11119505.196667, 4447802.078667),
 'lower_right': (-10007554.677, 3335851.559),
 'crs': 'GCTP_SNSOID',
 'crs_params': (6371007.181, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)}
In [33]:
def create_meshgrid(alignment_dict: Dict, shape: List[int]):
    """Given an image shape, create a meshgrid of points
    between bounding coordinates.

    Args:
        alignment_dict (Dict): dictionary containing, at a minimum,
            `upper_left` (tuple), `lower_right` (tuple), `crs` (str),
            and `crs_params` (tuple).
        shape (List[int]): dataset shape as a list of
            [orbits, height, width].

    Returns:
        xv (np.array): x (longitude) coordinates.
        yv (np.array): y (latitude) coordinates.
    """
    # Determine grid bounds using two coordinates
    x0, y0 = alignment_dict["upper_left"]
    x1, y1 = alignment_dict["lower_right"]

    # Interpolate points between corners, inclusive of bounds
    x = np.linspace(x0, x1, shape[2], endpoint=True)
    y = np.linspace(y0, y1, shape[1], endpoint=True)

    # Return two 2D arrays representing X & Y coordinates of all points
    xv, yv = np.meshgrid(x, y)
    return xv, yv
In [34]:
xv, yv = create_meshgrid(alignment_dict, shape)

Step 3: To determine which AOD readings overlap with our training grid cells, we'll want to work in a consistent CRS. The CRS of the competition grid cells is the latitude-longitude projection called WGS 84. WGS 84 is the most common CRS in the world and is a standard among GPS datasets. Let's reproject our new grid coordinates from the sinusoidal projection to WGS 84.

Pyproj's Proj, CRS, and transformer classes make this task relatively straightforward.

In [35]:
from pyproj import CRS, Proj
In [36]:
# Source: https://spatialreference.org/ref/sr-org/modis-sinusoidal/proj4js/
sinu_crs = Proj(
    f"+proj=sinu +R={alignment_dict['crs_params'][0]} +nadgrids=@null +wktext"
).crs
wgs84_crs = CRS.from_epsg("4326")
In [37]:
def transform_arrays(
    xv: Union[np.array, float], yv: Union[np.array, float], crs_from: CRS, crs_to: CRS
):
    """Transform points or arrays from one CRS to another CRS.

    Args:
        xv (np.array or float): x (longitude) coordinates or value.
        yv (np.array or float): y (latitude) coordinates or value.
        crs_from (CRS): source coordinate reference system.
        crs_to (CRS): destination coordinate reference system.

    Returns:
        lon, lat (tuple): x coordinate(s), y coordinate(s)
    """
    transformer = pyproj.Transformer.from_crs(
        crs_from,
        crs_to,
        always_xy=True,
    )
    lon, lat = transformer.transform(xv, yv)
    return lon, lat
In [38]:
# Project sinu grid onto wgs84 grid
lon, lat = transform_arrays(xv, yv, sinu_crs, wgs84_crs)

Our grid now contains a point in the WGS 84 CRS (i.e., latitude and longitude) for each calibrated AOD value.

Step 4: Finally, we can connect each AOD value with a specific latitude and longitude by flattening our data into a GeoDataFrame of non-missing values. Each row represents a pixel value at a specific time and location. A MAIAC granule can contain up to 1200 x 1200 x N rows, where N represents the number of orbits. Our helper function, convert_array_to_df, will allow us to repeat this process for any data. Notice that we've added an additional keyword argument, total_bounds. Later on, this will allow us to optimize our operations by filtering out irrelevant points. For now, let's take a look at the entire dataset.

In [39]:
def convert_array_to_df(
    corrected_arr: np.ma.MaskedArray,
    lat: np.ndarray,
    lon: np.ndarray,
    granule_id: str,
    crs: CRS,
    total_bounds: np.ndarray = None,
):
    """Align data values with latitude and longitude coordinates
    and return a GeoDataFrame.

    Args:
        corrected_arr (np.ma.MaskedArray): data values for each pixel.
        lat (np.ndarray): latitude for each pixel.
        lon (np.ndarray): longitude for each pixel.
        granule_id (str): granule name.
        crs (CRS): coordinate reference system
        total_bounds (np.ndarray, optional): If provided,
            will filter out points that fall outside of these bounds.
            Composed of xmin, ymin, xmax, ymax.
    """
    lats = lat.ravel()
    lons = lon.ravel()
    n_orbits = len(corrected_arr)
    size = lats.size
    values = {
        "value": np.concatenate([d.data.ravel() for d in corrected_arr]),
        "lat": np.tile(lats, n_orbits),
        "lon": np.tile(lons, n_orbits),
        "orbit": np.arange(n_orbits).repeat(size),
        "granule_id": [granule_id] * size * n_orbits,
    }

    df = pd.DataFrame(values).dropna()
    if total_bounds is not None:
        x_min, y_min, x_max, y_max = total_bounds
        df = df[df.lon.between(x_min, x_max) & df.lat.between(y_min, y_max)]

    gdf = gpd.GeoDataFrame(df)
    gdf["geometry"] = gpd.points_from_xy(gdf.lon, gdf.lat)
    gdf.crs = crs
    return gdf[["granule_id", "orbit", "geometry", "value"]].reset_index(drop=True)
In [40]:
gdf = convert_array_to_df(corrected_AOD, lat, lon, la_file_path.stem, wgs84_crs)
gdf.head(3)
Out[40]:
granule_id orbit geometry value
0 20180201T191000_maiac_la_0 0 POINT (-110.79078 35.36280) 0.110
1 20180201T191000_maiac_la_0 0 POINT (-110.78956 35.35446) 0.076
2 20180201T191000_maiac_la_0 0 POINT (-110.28947 35.28774) 0.112
In [41]:
gdf.shape
Out[41]:
(1203568, 4)

Notice that our dataframe contains over one million values. Let's take a look at the data contained in each orbit using matplotlib.

In [42]:
def plot_gdf(gdf: gpd.GeoDataFrame, separate_bands: bool = True):
    """Plot the Point objects contained in a GeoDataFrame.
    Option to overlay bands.

    Args:
        gdf (gpd.GeoDataFrame): GeoDataFrame with, at a minimum,
            columns for `orbit`, `geometry`, and `value`.
        separate_bands (bool): Plot each band on its own axis.
            Defaults to True.

    Displays a matplotlib scatterplot.
    """
    if separate_bands:
        num_orbits = gdf.orbit.max() + 1
        f, axes = plt.subplots(1, num_orbits, figsize=(20, 5), sharex=True, sharey=True)
        for i, ax in enumerate(axes):
            gdf_orbit = gdf[gdf.orbit == i]
            img = ax.scatter(
                x=gdf_orbit.geometry.x,
                y=gdf_orbit.geometry.y,
                c=gdf_orbit.value,
                s=0.1,
                alpha=1,
                cmap="RdYlBu_r",
            )
            ax.set_title(f"Band {i + 1}", fontsize=12)
    else:
        f, ax = plt.subplots(1, 1, figsize=(4, 4))
        img = ax.scatter(
            x=gdf.geometry.x,
            y=gdf.geometry.y,
            c=gdf.value,
            s=0.15,
            alpha=1,
            cmap="RdYlBu_r",
        )
    f.colorbar(img)
    plt.suptitle("Blue Band AOD", fontsize=12)
In [43]:
# Plot each orbit individually
plot_gdf(gdf, separate_bands=True)

Raster-based approach

Step 1: Again, according to our alignment_dict, the CRS of our data is the sinusoidal projection.

Step 2: For this approach, we only need the upper_left and lower_right coordinates as reference points. We do not need to create a mesh grid.

Step 3: Let's reproject our reference points to WGS 84.

In [44]:
# Identify grid bounds in the sinusoidal projection
x0_s, y0_s = alignment_dict["upper_left"]
x1_s, y1_s = alignment_dict["lower_right"]

# Reproject coordinates bounds to WGS 84
(x0_w, x1_w), (y0_w, y1_w) = transform_arrays(
    [x0_s, x1_s], [y0_s, y1_s], sinu_crs, wgs84_crs
)

Step 4: We can define a matrix called an affine transformation that will allow us to transform image coordinates to the sinusoidal projection. An affine transformation is a named tuple with elements a, b, c, d, e, f, which correspond to the elements in the matrix equation below, in which a pixel's image coordinates are x, y and its world coordinates are x', y'.

The only parameters we'll need are the bounds and resolution of our data file. We'll use rasterio.transform.from_bounds helper function to compute the matrix product of a translation and a scaling, which is the underlying formula to calculate this transformation matrix.

In [45]:
def compute_affine_transform(x0, x1, y0, y1, height: int, width: int):
    """Given bounding coordinates and resolution, compute
    the affine transformation.

    Args:
        x0 (float): left coordinate.
        x1 (float): right coordinate.
        y0 (float): top coordinate.
        y1 (float): bottom coordinate.
        height (int): image height.
        width (int): image width.

    Returns:
        affine_transform (Affine): Affine transformation.
    """
    affine_transform = rasterio.transform.from_bounds(
        west=x0, south=y1, east=x1, north=y0, height=height, width=width
    )
    return affine_transform

To save out a reprojected raster, we'll need to compute the affine transformation in the source CRS (sinusoidal projection) as well as the target CRS (WGS 84) using our reprojected bounds.

In [46]:
# Compute the sinusoidal affine transformation
transform_sinu = compute_affine_transform(x0_s, x1_s, y0_s, y1_s, shape[1], shape[2])
transform_sinu
Out[46]:
Affine(926.6254330558345, 0.0, -11119505.196667,
       0.0, -926.6254330558334, 4447802.078667)
In [47]:
# Compute WGS 84 affine transform
transform_wgs84 = compute_affine_transform(x0_w, x1_w, y0_w, y1_w, shape[1], shape[2])
transform_wgs84
Out[47]:
Affine(0.022181400393890355, 0.0, -130.54072891464722,
       0.0, -0.008333333332587473, 39.99999999641088)

Putting this all together, we can reproject our corrected_AOD values to WGS 84 and write out a GeoTIFF file that holds our values, data type, missing data value, shape, and the coordinates of each point described by transform_wgs84.

In [48]:
from rasterio.transform import Affine
from rasterio.warp import reproject, Resampling
In [49]:
def reproject_data(
    data: np.ma.MaskedArray,
    src_crs: CRS,
    dst_crs: CRS,
    src_transform: Affine,
    dst_transform: Affine,
):
    """Reproject a numpy array from one CRS to another, assuming
    a fill value of nan.

    Args:
        data (np.ma.MaskedArray): data as a masked array.
        src_crs (CRS): original coordinate reference system.
        dst_crs (CRS): destination coordinate reference system.
        src_transform (Affine): source affine transformation.
        dst_transform (Affine): destination affine transformation.

    Returns:
        dst_data (np.ma.MaskedArray): reprojected data as a masked array.
    """
    # Initialize array containing nans
    dst_arr = np.empty(data.shape, dtype=np.double)
    dst_arr[:] = np.nan

    # Reproject corrected AOD to WGS 84
    dst_data, dst_transform = reproject(
        source=data.data,
        destination=dst_arr,
        src_transform=src_transform,
        src_crs=src_crs,
        src_nodata=np.nan,
        dst_transform=dst_transform,
        dst_crs=dst_crs,
        dst_nodata=np.nan,
        resampling=Resampling.bilinear,
    )
    dst_data = np.ma.masked_array(dst_data, np.isnan(dst_data))
    dst_data.fill_value = np.nan
    return dst_data
In [50]:
# Reproject data to WGS 84
reprojected_data = reproject_data(
    corrected_AOD, sinu_crs, wgs84_crs, transform_sinu, transform_wgs84
)
In [51]:
# We can save our new tif to the "interim" folder
with INTERIM / la_file_path.name as filepath:
    with rasterio.open(
        fp=filepath,
        mode="w+",
        driver="GTiff",
        count=reprojected_data.shape[0],
        height=reprojected_data.shape[1],
        width=reprojected_data.shape[2],
        dtype=str(reprojected_data.dtype),
        crs=wgs84_crs,
        transform=transform_wgs84,
        nodata=reprojected_data.fill_value,
    ) as dst:
        dst.write(reprojected_data)

Note: while it may be computationally advantageous to take the raster-based approach to data alignment and reprojection, it is possible to lose valuable information during resampling. For instance, our reprojected raster file does not contain the same level of detail as our gdf. You'll want to experiment with different fill values, reprojections, and resampling methods if you decide to follow a raster-based approach.

Create and save feature data

We now have everything we need to iterate through our training files, calibrate and align AOD values, mask the images to the extent of our grid cell bounds, and save out our new training features. To demonstrate, let's prepare AOD data captured over the training grid cells in Los Angeles.

In [52]:
# Identify LA granule filepaths and grid cells
maiac_md = maiac_md[maiac_md.location == "la"]
la_fp = list(maiac_md.us_url)
la_gc = grid_md[grid_md.location == "Los Angeles (SoCAB)"].copy()

# Load training labels
train_labels = pd.read_csv(RAW / "train_labels.csv", parse_dates=["datetime"])
train_labels.rename(columns={"value": "pm25"}, inplace=True)

# Confirm all LA grid cells are in our training labels
assert la_gc.index.isin(train_labels.grid_id).all()

len(la_gc)
Out[52]:
14

Our training data contains 14 unique grid cells in Los Angeles. Using geopandas, we can transform our grid cell metadata into a GeoDataFrame where each observation represents a Polygon geometry.

In [53]:
la_polys = gpd.GeoSeries.from_wkt(la_gc.wkt, crs=wgs84_crs)  # used for WGS 84
la_polys.name = "geometry"
la_polys_gdf = gpd.GeoDataFrame(la_polys)

la_polys_gdf.plot(figsize=(3, 3)).set_title(
    "Training grid cell locations: Los Angeles", fontsize=9
);

We can use the spatial join sjoin function from geopandas to identify all AOD values that intersect these grid cells. But first, let's do some optimization. Remember that our gdf contains over one million points. Because the granule covers a much larger swath of land than just the South Coast Air Basin, the vast majority of those points will be irrelevant. We can use the total_bounds attribute of our grid cell GeoDataFrame as a bounding box. Here, we filter out points outside of the bounding box with the cx GeoDataFrame method. In our convert_array_to_df helper function, we just remove points that are not between our minimum and maximum latitudes and longitudes.

By removing these points, we dramatically speed up the spatial join! We're now ready to use a subset of our previously generated gdf to identify points that intersect with grid cells in Los Angeles and compute the average blue band AOD per cell.

In [54]:
xmin, ymin, xmax, ymax = la_polys_gdf.total_bounds
gpd.sjoin(la_polys_gdf, gdf.cx[xmin:xmax, ymin:ymax], how="inner").groupby("grid_id")[
    "value"
].mean()
Out[54]:
grid_id
3S31A    0.083632
A2FBI    0.081053
DHO4M    0.142227
DJN0F    0.077973
E5P9N    0.102154
FRITQ    0.123091
H96P6    0.132739
PG3MI    0.131395
QJHW4    0.132235
VBLD0    0.134880
WT52R    0.128528
X5DKW    0.162390
ZP1FZ    0.093143
ZZ8JF    0.083133
Name: value, dtype: float64

It's time to pull everything together! As a reminder, we've already created the following helper functions:

  • calibrate_data
  • create_meshgrid
  • transform_arrays
  • convert_array_to_df
  • compute_affine_transform

We'll add helper functions to create_calibration_dict and create_alignment_dict.

In [55]:
def create_calibration_dict(data: SDS):
    """Define calibration dictionary given a SDS dataset,
    which contains:
        - name
        - scale factor
        - offset
        - unit
        - fill value
        - valid range

    Args:
        data (SDS): dataset in the SDS format.

    Returns:
        calibration_dict (Dict): dict of calibration parameters.
    """
    return data.attributes()


def create_alignment_dict(hdf: SD):
    """Define alignment dictionary given a SD data file,
    which contains:
        - upper left coordinates
        - lower right coordinates
        - coordinate reference system (CRS)
        - CRS parameters

    Args:
        hdf (SD): hdf data object

    Returns:
        alignment_dict (Dict): dict of alignment parameters.
    """
    group_1 = hdf.attributes()["StructMetadata.0"].split("END_GROUP=GRID_1")[0]
    hdf_metadata = dict([x.split("=") for x in group_1.split() if "=" in x])
    alignment_dict = {
        "upper_left": eval(hdf_metadata["UpperLeftPointMtrs"]),
        "lower_right": eval(hdf_metadata["LowerRightMtrs"]),
        "crs": hdf_metadata["Projection"],
        "crs_params": hdf_metadata["ProjParams"],
    }
    return alignment_dict

As an example, we'll write a GeoDataframe that contains AOD values for all points in our grid cells for which we have available blue band AOD data. To speed things up, we parallelize this process using pqdm. Because we have granules that cover a range of times and locations, we expect the number of images per grid cell to vary.

Keep in mind, this represents just one of many approaches to using the satellite data to estimate PM2.5.

For example, you may choose to save out rasters cropped to the extent of the competition grid cells, compute a set of zonal statistics, or may want to include values from other bands like Optical_Depth_055 as additional image layers.

In [56]:
from pqdm.processes import pqdm


def preprocess_maiac_data(
    granule_path: str,
    grid_cell_gdf: gpd.GeoDataFrame,
    dataset_name: str,
    total_bounds: np.ndarray = None,
):
    """
    Given a granule s3 path and competition grid cells,
    create a GDF of each intersecting point and the accompanying
    dataset value (e.g. blue band AOD).

    Args:
        granule_path (str): a path to a granule on s3.
        grid_cell_gdf (gpd.GeoDataFrame): GeoDataFrame that contains,
            at a minimum, a `grid_id` and `geometry` column of Polygons.
        dataset_name (str): specific dataset name (e.g. "Optical_Depth_047").
        total_bounds (np.ndarray, optional): If provided, will filter out points that fall
            outside of these bounds. Composed of xmin, ymin, xmax, ymax.
    Returns:
        GeoDataFrame that contains Points and associated values.
    """
    # Load blue band AOD data
    s3_path = S3Path(granule_path)
    hdf = SD(s3_path.fspath, SDC.READ)
    aod = hdf.select(dataset_name)
    shape = aod.info()[2]

    # Calibrate and align data
    calibration_dict = aod.attributes()
    alignment_dict = create_alignment_dict(hdf)
    corrected_AOD = calibrate_data(aod, shape, calibration_dict)
    xv, yv = create_meshgrid(alignment_dict, shape)
    lon, lat = transform_arrays(xv, yv, sinu_crs, wgs84_crs)

    # Save values that align with granules
    granule_gdf = convert_array_to_df(
        corrected_AOD, lat, lon, s3_path.name, wgs84_crs, grid_cell_gdf.total_bounds
    )
    df = gpd.sjoin(grid_cell_gdf, granule_gdf, how="inner")

    # Clean up files
    Path(s3_path.fspath).unlink()
    hdf.end()
    return df.drop(columns="index_right").reset_index()


def preprocess_aod_47(granule_paths, grid_cell_gdf, n_jobs=2):
    """
    Given a set of granule s3 paths and competition grid cells,
    parallelizes creation of GDFs containing AOD 0.47 µm values and points.

    Args:
        granule_paths (List): list of paths on s3.
        grid_cell_gdf (gpd.GeoDataFrame): GeoDataFrame that contains,
            at a minimum, a `grid_id` and `geometry` column of Polygons.
        n_jobs (int, Optional): The number of parallel processes. Defaults to 2.

    Returns:
        GeoDataFrame that contains Points and associated values for all granules.
    """
    args = ((gp, grid_cell_gdf, "Optical_Depth_047") for gp in granule_paths)
    results = pqdm(args, preprocess_maiac_data, n_jobs=n_jobs, argument_type="args")
    return pd.concat(results)
In [ ]:
la_train_gdf = preprocess_aod_47(la_fp, la_polys_gdf)
In [58]:
la_train_gdf.head()
Out[58]:
grid_id geometry granule_id orbit value
0 3S31A POLYGON ((-117.93382 33.79558, -117.93382 33.8... 20180201T191000_maiac_la_0.hdf 1 0.067
1 3S31A POLYGON ((-117.93382 33.79558, -117.93382 33.8... 20180201T191000_maiac_la_0.hdf 1 0.074
2 3S31A POLYGON ((-117.93382 33.79558, -117.93382 33.8... 20180201T191000_maiac_la_0.hdf 2 0.086
3 3S31A POLYGON ((-117.93382 33.79558, -117.93382 33.8... 20180201T191000_maiac_la_0.hdf 2 0.098
4 3S31A POLYGON ((-117.93382 33.79558, -117.93382 33.8... 20180201T191000_maiac_la_0.hdf 1 0.070
In [59]:
la_train_gdf.shape
Out[59]:
(380815, 5)

la_train_gdf describes 380,815 blue band AOD values that overlap with the competition training grid cells in Los Angeles!

Simple benchmark model

Using la_train_df, let's prepare an extremely simple benchmark solution for the Airathon particulate track. Keep in mind that we don't expect this model to generalize well to Taipei or Delhi since it has not seen imagery from these geographies before.

In [60]:
def calculate_features(
    feature_df: gpd.GeoDataFrame,
    label_df: pd.DataFrame,
    stage: str,
):
    """Given processed AOD data and training labels (train) or
    submission format (test), return a feature GeoDataFrame that contains
    features for mean, max, and min AOD.

    Args:
        feature_df (gpd.GeoDataFrame): GeoDataFrame that contains
            Points and associated values.
        label_df (pd.DataFrame): training labels (train) or
            submission format (test).
        stage (str): "train" or "test".

    Returns:
        full_data (gpd.GeoDataFrame): GeoDataFrame that contains `mean_aod`,
            `max_aod`, and `min_aod` for each grid cell and datetime.
    """
    # Add `day` column to `feature_df` and `label_df`
    feature_df["datetime"] = pd.to_datetime(
        feature_df.granule_id.str.split("_", expand=True)[0],
        format="%Y%m%dT%H:%M:%S",
        utc=True,
    )
    feature_df["day"] = feature_df.datetime.dt.date
    label_df["day"] = label_df.datetime.dt.date

    # Calculate average AOD per day/grid cell for which we have feature data
    avg_aod_day = feature_df.groupby(["day", "grid_id"]).agg(
        {"value": ["mean", "min", "max"]}
    )
    avg_aod_day.columns = ["mean_aod", "min_aod", "max_aod"]
    avg_aod_day = avg_aod_day.reset_index()

    # Join labels/submission format with feature data
    how = "inner" if stage == "train" else "left"
    full_data = pd.merge(
        label_df,
        avg_aod_day,
        how=how,
        left_on=["day", "grid_id"],
        right_on=["day", "grid_id"],
    )
    return full_data
In [61]:
full_data = calculate_features(la_train_gdf, train_labels, stage="train")
full_data.head(3)
Out[61]:
datetime grid_id pm25 day mean_aod min_aod max_aod
0 2018-02-01 08:00:00+00:00 3S31A 11.4 2018-02-01 0.083632 0.063 0.107
1 2018-02-01 08:00:00+00:00 A2FBI 17.0 2018-02-01 0.081053 0.059 0.106
2 2018-02-01 08:00:00+00:00 DJN0F 11.1 2018-02-01 0.077973 0.036 0.133

We do not want to overstate our model's performance by overfitting to one or more dates or locations. To be sure that our method is sufficiently generalizable, let's set aside a portion of the data to validate the model. To ensure that we are not using data collected in the future to estimate prior PM2.5, we can subset the training data into training and validation sets using the datetime field.

In [62]:
# 2020 data will be held out for validation
train = full_data[full_data.datetime.dt.year <= 2019].copy()
test = full_data[full_data.datetime.dt.year > 2019].copy()

In scikit-learn, we can set up a random forest with a few lines of code.

In [63]:
from sklearn.ensemble import RandomForestRegressor
In [64]:
# Train model on train set
model = RandomForestRegressor()
model.fit(train[["mean_aod", "min_aod", "max_aod"]], train.pm25)

# Compute R2 using our holdout set
model.score(test[["mean_aod", "min_aod", "max_aod"]], test.pm25)
Out[64]:
0.005468857115181414
In [65]:
# Refit model on entire training set
model.fit(full_data[["mean_aod", "min_aod", "max_aod"]], full_data.pm25)
Out[65]:
RandomForestRegressor()

Finally, let's make predictions for the competition test set and save them to the submission format. To keep things simple, we'll impute missing features using the average, minimum, and maximum AOD values from the training data.

In [ ]:
# Identify test granule s3 paths
test_md = pm_md[(pm_md["product"] == "maiac") & (pm_md["split"] == "test")]

# Identify test grid cells
submission_format = pd.read_csv(RAW / "submission_format.csv", parse_dates=["datetime"])
test_gc = grid_md[grid_md.index.isin(submission_format.grid_id)]

# Process test data for each location
locations = test_gc.location.unique()
loc_map = {"Delhi": "dl", "Los Angeles (SoCAB)": "la", "Taipei": "tpe"}
loc_dfs = []

for loc in locations:
    print(loc)
    these_fp = list(test_md[test_md.location == loc_map[loc]].us_url)

    # Create grid cell GeoDataFrame
    these_grid_cells = test_gc[test_gc.location == loc]
    these_polys = gpd.GeoSeries.from_wkt(these_grid_cells.wkt, crs=wgs84_crs)
    these_polys.name = "geometry"
    this_polys_gdf = gpd.GeoDataFrame(these_polys)

    # Preprocess AOD for test granules
    loc_dfs.append(preprocess_aod_47(these_fp, this_polys_gdf))

test_df = pd.concat(loc_dfs)
In [67]:
test_df.head(3)
Out[67]:
grid_id geometry granule_id orbit value
0 VR4WG POLYGON ((121.52576 25.05906, 121.52576 25.099... 20170107T032000_maiac_tpe_0.hdf 1 0.594
1 VR4WG POLYGON ((121.52576 25.05906, 121.52576 25.099... 20170107T032000_maiac_tpe_0.hdf 1 0.594
0 1X116 POLYGON ((121.52576 24.97766, 121.52576 25.018... 20170110T021000_maiac_tpe_0.hdf 2 0.604
In [68]:
# Prepare AOD features
submission_df = calculate_features(test_df, submission_format, stage="test")

# Impute missing features using training set mean/max/min
submission_df.mean_aod.fillna(la_train_gdf.value.mean(), inplace=True)
submission_df.min_aod.fillna(la_train_gdf.value.min(), inplace=True)
submission_df.max_aod.fillna(la_train_gdf.value.max(), inplace=True)
submission_df.drop(columns=["day"], inplace=True)

# Make predictions using AOD features
submission_df["preds"] = model.predict(submission_df[["mean_aod", "min_aod", "max_aod"]])
submission = submission_df[["datetime", "grid_id", "preds"]].copy()
submission.rename(columns={"preds": "value"}, inplace=True)

# Ensure submission indices match submission format
submission_format.set_index(["datetime", "grid_id"], inplace=True)
submission.set_index(["datetime", "grid_id"], inplace=True)
assert submission_format.index.equals(submission.index)
In [69]:
submission.head(3)
Out[69]:
value
datetime grid_id
2017-01-07 16:00:00+00:00 1X116 11.954145
9Q6TA 11.954145
KW43U 11.954145
In [70]:
submission.describe()
Out[70]:
value
count 13504.000000
mean 18.695326
std 12.437289
min 3.130802
25% 11.954145
50% 12.788576
75% 18.310239
max 78.187873

Finally, we will save our submission locally.

In [71]:
# Save submission in the correct format
final_submission = pd.read_csv(RAW / "submission_format.csv")
final_submission["value"] = submission.reset_index().value
final_submission.to_csv((INTERIM / "submission.csv"), index=False)

We can now head to the competition submissions page and upload our submission to obtain our model's R2.

We should see an R2 of -0.0655 on the leaderboard. Lots of room for improvement! This model was only trained on data from Los Angeles and has very simple features. Now it's up to you to make improvements and increase this score. We hope this tutorial provides a helpful framework for loading and exploring the data, correcting and aligning values, and working with geospatial vector and raster data.

Head over to the challenge homepage to get started. We can't wait to see what you build!

Banner image, artist's rendering of Aqua, courtesy of NASA

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

winners

Meet the winners of the AI for Advancing Instruction Challenge

Learn how the winners of the AIAI challenge leveraged multimodal classroom data to identify instructional activities and classroom discourse content.

case studies

Automating wildlife monitoring with Zamba & Zamba Cloud

DrivenData partnered with conservation researchers to create Zamba, an open-source machine learning solution that helps wildlife researchers process camera trap footage, reducing months of manual review to hours of automated analysis.

community

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

The Community Spotlight features fantastic members from our DrivenData community. Three members of the IGCPHARMA team, Paola Ruiz, Néstor González, and Daniel Crovo talk to us about data science, drug discovery, diverse databases and more!

community

Community Spotlight: Kirill Brodt

The Community Spotlight features fantastic members from our DrivenData community. Kirill Brodt, a researcher in computer graphics at the University of Montreal, talks animation, pose estimation, and data science challenges.

case studies

Jump-starting data infrastructure and in-house data expertise

DrivenData designed and built a data warehouse to centralize, organize, and visualize data across CodePath's operations. Our team also provided technical hiring assistance to find the right talent to carry the work forward.

case studies

A production application to support survivors of human trafficking

DrivenData developed Freedom Lifemap, a digital tool designed to support survivors of human trafficking on their journey toward reintegration and independence.

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

insights

(Tech) Infrastructure Week for the Nonprofit Sector

Reflections on how to build data and AI infrastructure in the social sector that serves the needs of nonprofits and their beneficiaries.

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

insights

AI sauce on everything: Reflections on ASU+GSV 2025

Data, evaltuation, product iteration, and public goods: reflections on the ASU+GSV Summit 2025.

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

case studies

Crowdsourcing solutions for AI-assisted early literacy screening

DrivenData ran a machine learning competition to develop models for scoring audio recordings from literacy screener exercises completed by students in kindergarten through 3rd grade.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

case studies

Mapping agricultural trends in Yemen during crisis

DrivenData partnered with The World Bank to use machine learning and remote sensing data to track agricultural changes across Yemen from 2019-2024, providing critical insights for food security planning in a conflict-affected region.

case studies

Making higher education data more accessible

DrivenData partnered with Science for America to develop scipeds, an open source Python library and interactive data visualization platform designed to simplify the analysis of U.S. higher education data from IPEDS and to illuminate trends and disparities in STEM education.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.