by
Katie Wetstone
How to Predict Harmful Algal Blooms Using LightGBM and Satellite Imagery¶
Welcome to the benchmark notebook for the Tick Tick Bloom: Harmful Algal Bloom Detection Challenge! If you are just getting starting, we recommended reading the competition homepage first. The goal of this benchmark is to:
- Demonstrate how to explore and work with the data
- Provide a basic framework for building a model
- Demonstrate how to package your work correctly for submission
You can either expand on and improve this benchmark, or start with something completely different! Let's get started.
Inland water bodies like lakes and reservoirs provide critical drinking water and recreation for communities, and habitats for marine life. A significant challenge that water quality managers face is the formation of harmful algal blooms (HABs) such as cyanobacteria. HABs produce toxins that are poisonous to humans and their pets, and threaten marine ecosystems by blocking sunlight and oxygen. Manual water sampling, or “in situ” sampling, is generally used to monitor cyanobacteria in inland water bodies. In situ sampling is accurate, but time intensive and difficult to perform continuously.
Your goal in this challenge is to use satellite imagery to detect and classify the severity of cyanobacteria blooms in small, inland water bodies like reservoirs. Ultimately, better awareness of algal blooms helps keep both the human and marine life that relies on these water bodies safe and healthy.
In this post, we'll cover:¶
- Exploring the data
- Pulling and processing feature data from Sentinel-2 L2A and Landsat Level-2 using the Planetary Computer
- Building a tree-based model using LightGBM
- Generating a properly formatted submission
%load_ext lab_black
%load_ext autoreload
%autoreload 2
import cv2
from datetime import timedelta
import matplotlib.pyplot as plt
import numpy as np
import odc.stac
import pandas as pd
from pathlib import Path
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm
%matplotlib inline
First, we'll download all of the competition data to a local folder.
The contents of the data folder are:
.
├── metadata.csv
├── submission_format.csv
└── train_labels.csv
DATA_DIR = Path.cwd().parent.resolve() / "data/final/public"
assert DATA_DIR.exists()
Explore the data¶
Labels in this competition are based on “in situ” samples that were collected manually and then analyzed for cyanobacteria density. Each measurement is a unique combination of date and location (latitude and longitude).
metadata.csv
¶
metadata.csv
includes all of the sample measurements in the data. It has the following columns:
uid
(str): unique ID for each row. Each row is a unique combination of date and location (latitude and longitude).date
(pd.datetime): date when the sample was collected, in the format YYYY-MM-DDlatitude
(float): latitude of the location where the sample was collectedlongitude
(float): longitude of the location where the sample was collectedregion
(str): region of the US. This will be used in scoring to calculate region-specific RMSEs. Final score will be the average across the four US regions.split
(str): indicates whether the row is part of the train set or the test set. Metadata is provided for all points in the train and test sets.
The geographic points in the train and test sets are distinct. This means that none of the test set points are also in the train set, so your model's performance will be measured on unseen locations.
The main feature data for this competition is satellite imagery from Sentinel-2 and Landsat. Participants will access all feature data through external, publicly available APIs. Relevant imagery can be identified using the location and date of each sample from the metadata.
For a more detailed description of the competition data, see the Problem Description page.
metadata = pd.read_csv(DATA_DIR / "metadata.csv")
metadata.head()
metadata.split.value_counts(dropna=False)
Location
Let's take a look at the distribution of samples by location.
import geopandas as gpd
from shapely.geometry import Point
# load the default geopandas base map file to plot points on
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
fig, ax = plt.subplots(figsize=(9, 4))
# map the training data
base = world[world.name == "United States of America"].plot(
edgecolor="gray", color="ghostwhite", figsize=(9, 4), alpha=0.3, ax=ax
)
train_meta = metadata[metadata["split"] == "train"]
geometry = [Point(xy) for xy in zip(train_meta["longitude"], train_meta["latitude"])]
gdf = gpd.GeoDataFrame(train_meta, geometry=geometry)
gdf.plot(ax=base, marker=".", markersize=3, color="blue", label="Train", alpha=0.6)
# map the test data
test_meta = metadata[metadata["split"] == "test"]
geometry = [Point(xy) for xy in zip(test_meta["longitude"], test_meta["latitude"])]
gdf = gpd.GeoDataFrame(test_meta, geometry=geometry)
gdf.plot(ax=base, marker=".", markersize=3, color="orange", label="Test", alpha=0.6)
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.xlim([-125, -65])
plt.ylim([25, 50])
plt.legend(loc=4, markerscale=3)
We can see that the sampling points are distributed all throughout the continental United States.
Date
What date range are the samples from?
# convert date to pd.datetime
metadata.date = pd.to_datetime(metadata.date)
# what is the date range?
metadata.groupby("split").agg(min_date=("date", min), max_date=("date", max))
# what years are in the data?
pd.crosstab(metadata.date.dt.year, metadata.split).plot(kind="bar")
plt.ylabel("Number of samples")
plt.xlabel("Year")
plt.title("Distribution of years in the data")
# what seasons are the data points from?
metadata["season"] = (
metadata.date.dt.month.replace([12, 1, 2], "winter")
.replace([3, 4, 5], "spring")
.replace([6, 7, 8], "summer")
.replace([9, 10, 11], "fall")
)
metadata.season.value_counts()
Most of the data is from summer. Harmful algal blooms are more likely to be dangerous during the summer because more individuals are taking advantage of water bodies like lakes for recreation.
# where is data from for each season?
fig, axes = plt.subplots(2, 2, figsize=(10, 5))
for season, ax in zip(metadata.season.unique(), axes.flatten()):
base = world[world.name == "United States of America"].plot(
edgecolor="gray", color="ghostwhite", alpha=0.3, ax=ax
)
sub = metadata[metadata.season == season]
geometry = [Point(xy) for xy in zip(sub["longitude"], sub["latitude"])]
gdf = gpd.GeoDataFrame(sub, geometry=geometry)
gdf.plot(ax=base, marker=".", markersize=2.5)
ax.set_xlim([-125, -66])
ax.set_ylim([25, 50])
ax.set_title(f"{season.capitalize()} data points")
ax.axis("off")
train_labels.csv
¶
Let's look at the labels for the training data.
train_labels = pd.read_csv(DATA_DIR / "train_labels.csv")
train_labels.head()
train_labels.shape
We have one row per in situ sample. Each row is a unique combination of date and location (latitude + longitude). There are columns for:
uid
(str): unique ID for each row. Theuid
maps each row intrain_labels.csv
tometadata.csv
region
(str): US region in which the sample was taken. Scores are calculated separately for each of these regions, and then averaged. See the Problem Description page for details.severity
(int): severity level based on the cyanobacteria density. This is what you'll be predicting.density
(float): raw measurement of total cyanobacteria density in cells per milliliter (mL)
Note that the target value for participants to predict a severity level, NOT raw cell density in cells per mL. The severity levels are:
severity |
Density range (cells per mL) |
---|---|
1 | <20,000 |
2 | 20,000 - <100,000 |
3 | 100,000 - <1,000,000 |
4 | 1,000,000 - <10,000,000 |
5 | ≥10,000,000 |
We can find the matching metadata for each training label by merging on uid
. For example:
train_labels_and_metadata = train_labels.merge(
metadata, how="left", left_on="uid", right_on="uid", validate="1:1"
)
Cyanobacteria values
What are the cyanobacteria severity values like?
severity_counts = (
train_labels.replace(
{
"severity": {
1: "1 (<20,000)",
2: "2 (20,000-100,000)",
3: "3 (100,000 - 1,000,000)",
4: "4 (1,00,000 - 10,000,000)",
5: "5 (>10,000,00)",
}
}
)
.severity.value_counts()
.sort_index(ascending=False)
)
plt.barh(severity_counts.index, severity_counts.values)
plt.xlabel("Number of samples")
plt.ylabel("Severity (range in cells/mL)")
plt.title("Train labels severity level counts")
train_labels.density.describe()
We can see that the density is highly skewed to the right, with a long tail of very high values.
There are also some rows where the density of cyanobacteria is 0.
(train_labels.density == 0).sum()
submission_format.csv
¶
Now let's look at the submission format. To create your own submission, download the submission format and replace the severity
column with your own prediction values. To be scored correctly, the order of your predictions must match the order of uid
s in the submission format.
submission_format = pd.read_csv(DATA_DIR / "submission_format.csv", index_col=0)
submission_format.head()
submission_format.shape
Like the training data, each row is a unique combination of location and date. The columns in submission_format.csv
are:
uid
(str): unique ID for each row. Theuid
maps each row intrain_labels.csv
tometadata.csv
region
(str): US region in which the sample was taken. Scores are calculated separately for each of these regions, and then averaged. See the Problem Description page for details.severity
(int): placeholder for severity level based on the cyanobacteria density - all values are 0. This is the column that you will replace with your own predictions to create a submission. Participants should submit predictions for severity level, NOT for the raw cell density value in cells per mL.
Process feature data¶
Feature data is not provided through the competition site. Instead, participants will access all feature data through external, publicly available APIs. Relevant imagery can be identified using the location and date of each sample, listed in metadata.csv
.
In this section, we'll walk through the process of pulling in matching satellite imagery using Microsoft's Planetary Computer. We'll only use Sentinel-2 L2A and Landsat Level-2 satellite imagery. Sentinel-2 L1C and Landsat Level-1 imagery are also allowed. For links to more guides on how to pull feature data, see the Problem Description page. There are a number of optional sources in addition to satellite imagery that you can also experiment with!
The general steps we'll use to pull satellite data are:
Establish a connection to the Planetary Computer's STAC API using the
planetary_computer
andpystac_client
Python packages.Query the STAC API for scenes that capture our in situ labels. For each sample, we'll search for imagery that includes the sample's location (latitude and longitude) around the date the sample was taken. In this benchmark, we'll use only Sentinel-2 L2A and Landsat Level-2 data.
Select one image for each sample. We'll use Sentinel-2 data wherever it is available, because it is higher resolution. We'll have to use Landsat for data before roughly 2016, because Sentinel-2 was not available yet.
Convert the image to a 1-dimensional list of features that can be input into our tree model
We'll first walk through these steps for just one example measurement. Then, we'll refactor the process and run it on all of our metadata.
# Establish a connection to the STAC API
import planetary_computer as pc
from pystac_client import Client
catalog = Client.open(
"https://planetarycomputer.microsoft.com/api/stac/v1", modifier=pc.sign_inplace
)
Walk through pulling feature data for one sample¶
Query the STAC API¶
example_row = metadata[metadata.uid == "garm"].iloc[0]
example_row
We can search (41.98006, -110.65734) in Google maps to see example where this sample is from. Turns out it's from Lake Viva Naughton in Wyoming!
Location range: We'll search an area with 50,000 meters on our either side of our sample point (100,000 m x 100,000 m), to make sure we're pulling in all relevant imagery. This is just a starting point, and you can improve on how to best search for the correct location in the Planetary Computer.
import geopy.distance as distance
# get our bounding box to search latitude and longitude coordinates
def get_bounding_box(latitude, longitude, meter_buffer=50000):
"""
Given a latitude, longitude, and buffer in meters, returns a bounding
box around the point with the buffer on the left, right, top, and bottom.
Returns a list of [minx, miny, maxx, maxy]
"""
distance_search = distance.distance(meters=meter_buffer)
# calculate the lat/long bounds based on ground distance
# bearings are cardinal directions to move (south, west, north, and east)
min_lat = distance_search.destination((latitude, longitude), bearing=180)[0]
min_long = distance_search.destination((latitude, longitude), bearing=270)[1]
max_lat = distance_search.destination((latitude, longitude), bearing=0)[0]
max_long = distance_search.destination((latitude, longitude), bearing=90)[1]
return [min_long, min_lat, max_long, max_lat]
bbox = get_bounding_box(example_row.latitude, example_row.longitude, meter_buffer=50000)
bbox
Time range: We want our feature data to be as close to the time of the sample as possible, because in algal blooms in small water bodies form and move very rapidly. Remember, you cannot use any data collected after the date of the sample.
Imagery taken with roughly 10 days of the sample will generally still be an accurate representation of environmental conditions at the time of the sample. For some data points you may not be able to get data within 10 days, and may have to use earlier data. We'll search the fifteen days up to the sample time, including the sample date.
# get our date range to search, and format correctly for query
def get_date_range(date, time_buffer_days=15):
"""Get a date range to search for in the planetary computer based
on a sample's date. The time range will include the sample date
and time_buffer_days days prior
Returns a string"""
datetime_format = "%Y-%m-%dT"
range_start = pd.to_datetime(date) - timedelta(days=time_buffer_days)
date_range = f"{range_start.strftime(datetime_format)}/{pd.to_datetime(date).strftime(datetime_format)}"
return date_range
date_range = get_date_range(example_row.date)
date_range
# search the planetary computer sentinel-l2a and landsat level-2 collections
search = catalog.search(
collections=["sentinel-2-l2a", "landsat-c2-l2"], bbox=bbox, datetime=date_range
)
# see how many items were returned
items = [item for item in search.get_all_items()]
len(items)
Select one image¶
The planetary computer returned 46 different items! Let's look at some of the details of the items that were found to match our label.
Remember that our example measurement was taken on 2021-09-27 at coordinates (41.98006, -110.65734). Because we used a bounding box around the sample to search, the Planetary Computer returned all items that contain any part of that bounding box. This means we still have to double check whether each item actually contains our sample point.
# get details of all of the items returned
item_details = pd.DataFrame(
[
{
"datetime": item.datetime.strftime("%Y-%m-%d"),
"platform": item.properties["platform"],
"min_long": item.bbox[0],
"max_long": item.bbox[2],
"min_lat": item.bbox[1],
"max_lat": item.bbox[3],
"bbox": item.bbox,
"item_obj": item,
}
for item in items
]
)
# check which rows actually contain the sample location
item_details["contains_sample_point"] = (
(item_details.min_lat < example_row.latitude)
& (item_details.max_lat > example_row.latitude)
& (item_details.min_long < example_row.longitude)
& (item_details.max_long > example_row.longitude)
)
print(
f"Filtering from {len(item_details)} returned to {item_details.contains_sample_point.sum()} items that contain the sample location"
)
item_details = item_details[item_details["contains_sample_point"]]
item_details[["datetime", "platform", "contains_sample_point", "bbox"]].sort_values(
by="datetime"
)
To keep things simple in this benchmark, we'll just choose one to input into our benchmark model. Note that in your solution, you could find a way to incorporate multiple images!
We'll narrow to one image in two steps:
If any Sentinel imagery is available, filter to only Sentinel imagery. Sentinel-2 is higher resolution than Landsat, which is extremely helpful for blooms in small water bodies. In this case, two images are from Sentinel and contain the actual sample location.
Select the item that is the closest time wise to the sampling date. This gives us a Sentinel-2A item that was captured on 10/20/2022 - two days before our sample was collected on 10/22.
This is a very simple way to choose the best image. You may want to explore additional strategies like selecting an image with less cloud cover obscuring the Earth's surface (as in this tutorial).
# 1 - filter to sentinel
item_details[item_details.platform.str.contains("Sentinel")]
# 2 - take closest by date
best_item = (
item_details[item_details.platform.str.contains("Sentinel")]
.sort_values(by="datetime", ascending=False)
.iloc[0]
)
best_item
This gives us a Sentinel-2A item that was captured on 9/24/2021 -- three days before our sample was collected on 9/27/2021.
item = best_item.item_obj
This item comes with a number of "assets". Let's see what these are.
# What assets are available?
for asset_key, asset in item.assets.items():
print(f"{asset_key:<25} - {asset.title}")
We have visible bands (red, green, and blue), as well as a number of other spectral ranges and a few algorithmic bands. The Sentinel-2 mission guide has more details about what these bands are and how to use them!
A few of the algorithmic bands that may be useful are:
Scene classification (SCL): The scene classification band sorts pixels into categories including water, high cloud probability, medium cloud probability, and vegetation. Water pixels could be used to calculate the size of a given water body, which impacts the behavior of blooms. Vegetation can indicate non-toxic marine life like sea grass that sometimes resembles cyanobacteria.
Cloud masks (CLM): Sentinel-2’s cloud bands can be used to filter out cloudy pixels. Pixels that are obscured by clouds are likely not helpful for algal bloom detection because the actual water surface is not captured.
We won't explore these in the benchmark, but you can go wild in your solution!
Visualizing Sentinel-2 imagery¶
Let's see what the imagery actually looks like.
import rioxarray
from IPython.display import Image
from PIL import Image as PILImage
# see the whole image
img = Image(url=item.assets["rendered_preview"].href, width=500)
Image(url=item.assets["rendered_preview"].href, width=500)
That's a LOT of area, and all we care about is right around our sampling point. Let's get a closer to look to make sure we're in the right neighborhood.
def crop_sentinel_image(item, bounding_box):
"""
Given a STAC item from Sentinel-2 and a bounding box tuple in the format
(minx, miny, maxx, maxy), return a cropped portion of the item's visual
imagery in the bounding box.
Returns the image as a numpy array with dimensions (color band, height, width)
"""
(minx, miny, maxx, maxy) = bounding_box
image = rioxarray.open_rasterio(pc.sign(item.assets["visual"].href)).rio.clip_box(
minx=minx,
miny=miny,
maxx=maxx,
maxy=maxy,
crs="EPSG:4326",
)
return image.to_numpy()
# get a smaller geographic bounding box
minx, miny, maxx, maxy = get_bounding_box(
example_row.latitude, example_row.longitude, meter_buffer=3000
)
# get the zoomed in image array
bbox = (minx, miny, maxx, maxy)
zoomed_img_array = crop_sentinel_image(item, bbox)
zoomed_img_array[0]
# we have to transpose some of the dimensions to plot
# matplotlib expects channels in a certain order
plt.imshow(np.transpose(zoomed_img_array, axes=[1, 2, 0]))
To make sure this imagery looks right, we can search for the sample in Google maps with (latitude, longitude)
-- in this case, we'll search (41.98006, -110.65734). That's a match!
landsat_item = (
item_details[item_details.platform.str.contains("landsat")]
.sample(n=1, random_state=3)
.iloc[0]
)
landsat_item
def crop_landsat_image(item, bounding_box):
"""
Given a STAC item from Landsat and a bounding box tuple in the format
(minx, miny, maxx, maxy), return a cropped portion of the item's visual
imagery in the bounding box.
Returns the image as a numpy array with dimensions (color band, height, width)
"""
(minx, miny, maxx, maxy) = bounding_box
image = odc.stac.stac_load(
[pc.sign(item)], bands=["red", "green", "blue"], bbox=[minx, miny, maxx, maxy]
).isel(time=0)
image_array = image[["red", "green", "blue"]].to_array().to_numpy()
# normalize to 0 - 255 values
image_array = cv2.normalize(image_array, None, 0, 255, cv2.NORM_MINMAX)
return image_array
item = landsat_item.item_obj
# we'll use the same cropped area as above
landsat_image_array = crop_landsat_image(item, bbox)
landsat_image_array[0]
plt.imshow(np.transpose(landsat_image_array, axes=[1, 2, 0]))
Note that unlike Sentinel-2 imagery, Landsat imagery is not originally returned with image values scaled to 0-255. Our function above scales the pixel values with cv2.normalize(image_array, None, 0, 255, cv2.NORM_MINMAX)
so that it is more comparable to our Sentinel-2 imagery, and we can input both of them as features into our model. You may want to explore other methods of converting Sentinel and Landsat imagery to comparable scales to make sure that no information is lost when re-scaling.
For an example, let's look at the Landsat image for the item shown above before it is rescaled to 0-255.
# load image but don't convert to numpy or rescale
image = odc.stac.stac_load(
[pc.sign(item)], bands=["red", "green", "blue"], bbox=bbox
).isel(time=0)
image_array = image[["red", "green", "blue"]].to_array()
# values are not scaled 0 - 255 when first returned
image_array[0]
# image appears differently without rescaling
fig, ax = plt.subplots(figsize=(5, 5))
image_array.plot.imshow(robust=True, ax=ax)