blog

How to Predict Harmful Algal Blooms Using LightGBM and Satellite Imagery


by Katie Wetstone

How to Predict Harmful Algal Blooms Using LightGBM and Satellite Imagery

Welcome to the benchmark notebook for the Tick Tick Bloom: Harmful Algal Bloom Detection Challenge! If you are just getting starting, we recommended reading the competition homepage first. The goal of this benchmark is to:

  1. Demonstrate how to explore and work with the data
  2. Provide a basic framework for building a model
  3. Demonstrate how to package your work correctly for submission

You can either expand on and improve this benchmark, or start with something completely different! Let's get started.


Inland water bodies like lakes and reservoirs provide critical drinking water and recreation for communities, and habitats for marine life. A significant challenge that water quality managers face is the formation of harmful algal blooms (HABs) such as cyanobacteria. HABs produce toxins that are poisonous to humans and their pets, and threaten marine ecosystems by blocking sunlight and oxygen. Manual water sampling, or “in situ” sampling, is generally used to monitor cyanobacteria in inland water bodies. In situ sampling is accurate, but time intensive and difficult to perform continuously.

Your goal in this challenge is to use satellite imagery to detect and classify the severity of cyanobacteria blooms in small, inland water bodies like reservoirs. Ultimately, better awareness of algal blooms helps keep both the human and marine life that relies on these water bodies safe and healthy.

In this post, we'll cover:

In [1]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2
In [2]:
import cv2
from datetime import timedelta
import matplotlib.pyplot as plt
import numpy as np
import odc.stac
import pandas as pd
from pathlib import Path
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from tqdm import tqdm

%matplotlib inline

First, we'll download all of the competition data to a local folder.

The contents of the data folder are:

.
├── metadata.csv
├── submission_format.csv
└── train_labels.csv
In [3]:
DATA_DIR = Path.cwd().parent.resolve() / "data/final/public"
assert DATA_DIR.exists()

Explore the data

Labels in this competition are based on “in situ” samples that were collected manually and then analyzed for cyanobacteria density. Each measurement is a unique combination of date and location (latitude and longitude).

metadata.csv

metadata.csv includes all of the sample measurements in the data. It has the following columns:

  • uid (str): unique ID for each row. Each row is a unique combination of date and location (latitude and longitude).

  • date (pd.datetime): date when the sample was collected, in the format YYYY-MM-DD

  • latitude (float): latitude of the location where the sample was collected

  • longitude (float): longitude of the location where the sample was collected

  • region (str): region of the US. This will be used in scoring to calculate region-specific RMSEs. Final score will be the average across the four US regions.

  • split (str): indicates whether the row is part of the train set or the test set. Metadata is provided for all points in the train and test sets.

The geographic points in the train and test sets are distinct. This means that none of the test set points are also in the train set, so your model's performance will be measured on unseen locations.

The main feature data for this competition is satellite imagery from Sentinel-2 and Landsat. Participants will access all feature data through external, publicly available APIs. Relevant imagery can be identified using the location and date of each sample from the metadata.

For a more detailed description of the competition data, see the Problem Description page.

In [4]:
metadata = pd.read_csv(DATA_DIR / "metadata.csv")
metadata.head()
Out[4]:
uid latitude longitude date split
0 aabm 39.080319 -86.430867 2018-05-14 train
1 aabn 36.559700 -121.510000 2016-08-31 test
2 aacd 35.875083 -78.878434 2020-11-19 train
3 aaee 35.487000 -79.062133 2016-08-24 train
4 aaff 38.049471 -99.827001 2019-07-23 train
In [5]:
metadata.split.value_counts(dropna=False)
Out[5]:
train    17060
test      6510
Name: split, dtype: int64

Location

Let's take a look at the distribution of samples by location.

In [6]:
import geopandas as gpd
from shapely.geometry import Point

# load the default geopandas base map file to plot points on
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
In [7]:
fig, ax = plt.subplots(figsize=(9, 4))

# map the training data
base = world[world.name == "United States of America"].plot(
    edgecolor="gray", color="ghostwhite", figsize=(9, 4), alpha=0.3, ax=ax
)
train_meta = metadata[metadata["split"] == "train"]
geometry = [Point(xy) for xy in zip(train_meta["longitude"], train_meta["latitude"])]
gdf = gpd.GeoDataFrame(train_meta, geometry=geometry)
gdf.plot(ax=base, marker=".", markersize=3, color="blue", label="Train", alpha=0.6)

# map the test data
test_meta = metadata[metadata["split"] == "test"]
geometry = [Point(xy) for xy in zip(test_meta["longitude"], test_meta["latitude"])]
gdf = gpd.GeoDataFrame(test_meta, geometry=geometry)
gdf.plot(ax=base, marker=".", markersize=3, color="orange", label="Test", alpha=0.6)

plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.xlim([-125, -65])
plt.ylim([25, 50])
plt.legend(loc=4, markerscale=3)
Out[7]:
<matplotlib.legend.Legend at 0x7f7ed7562c10>

We can see that the sampling points are distributed all throughout the continental United States.

Date

What date range are the samples from?

In [8]:
# convert date to pd.datetime
metadata.date = pd.to_datetime(metadata.date)

# what is the date range?
metadata.groupby("split").agg(min_date=("date", min), max_date=("date", max))
Out[8]:
min_date max_date
split
test 2013-01-08 2021-12-29
train 2013-01-04 2021-12-14
In [9]:
# what years are in the data?
pd.crosstab(metadata.date.dt.year, metadata.split).plot(kind="bar")
plt.ylabel("Number of samples")
plt.xlabel("Year")
plt.title("Distribution of years in the data")
Out[9]:
Text(0.5, 1.0, 'Distribution of years in the data')
In [10]:
# what seasons are the data points from?
metadata["season"] = (
    metadata.date.dt.month.replace([12, 1, 2], "winter")
    .replace([3, 4, 5], "spring")
    .replace([6, 7, 8], "summer")
    .replace([9, 10, 11], "fall")
)
metadata.season.value_counts()
Out[10]:
summer    10813
spring     5045
fall       4758
winter     2954
Name: season, dtype: int64

Most of the data is from summer. Harmful algal blooms are more likely to be dangerous during the summer because more individuals are taking advantage of water bodies like lakes for recreation.

In [11]:
# where is data from for each season?
fig, axes = plt.subplots(2, 2, figsize=(10, 5))

for season, ax in zip(metadata.season.unique(), axes.flatten()):
    base = world[world.name == "United States of America"].plot(
        edgecolor="gray", color="ghostwhite", alpha=0.3, ax=ax
    )

    sub = metadata[metadata.season == season]
    geometry = [Point(xy) for xy in zip(sub["longitude"], sub["latitude"])]
    gdf = gpd.GeoDataFrame(sub, geometry=geometry)
    gdf.plot(ax=base, marker=".", markersize=2.5)
    ax.set_xlim([-125, -66])
    ax.set_ylim([25, 50])
    ax.set_title(f"{season.capitalize()} data points")
    ax.axis("off")

train_labels.csv

Let's look at the labels for the training data.

In [12]:
train_labels = pd.read_csv(DATA_DIR / "train_labels.csv")
train_labels.head()
Out[12]:
uid region severity density
0 aabm midwest 1 585.0
1 aacd south 1 290.0
2 aaee south 1 1614.0
3 aaff midwest 3 111825.0
4 aafl midwest 4 2017313.0
In [13]:
train_labels.shape
Out[13]:
(17060, 4)

We have one row per in situ sample. Each row is a unique combination of date and location (latitude + longitude). There are columns for:

  • uid (str): unique ID for each row. The uid maps each row in train_labels.csv to metadata.csv

  • region (str): US region in which the sample was taken. Scores are calculated separately for each of these regions, and then averaged. See the Problem Description page for details.

  • severity (int): severity level based on the cyanobacteria density. This is what you'll be predicting.

  • density (float): raw measurement of total cyanobacteria density in cells per milliliter (mL)

Note that the target value for participants to predict a severity level, NOT raw cell density in cells per mL. The severity levels are:

severity Density range (cells per mL)
1 <20,000
2 20,000 - <100,000
3 100,000 - <1,000,000
4 1,000,000 - <10,000,000
5 ≥10,000,000

We can find the matching metadata for each training label by merging on uid. For example:

train_labels_and_metadata = train_labels.merge(
    metadata, how="left", left_on="uid", right_on="uid", validate="1:1"
)

Cyanobacteria values

What are the cyanobacteria severity values like?

In [14]:
severity_counts = (
    train_labels.replace(
        {
            "severity": {
                1: "1 (<20,000)",
                2: "2 (20,000-100,000)",
                3: "3 (100,000 - 1,000,000)",
                4: "4 (1,00,000 - 10,000,000)",
                5: "5 (>10,000,00)",
            }
        }
    )
    .severity.value_counts()
    .sort_index(ascending=False)
)
plt.barh(severity_counts.index, severity_counts.values)
plt.xlabel("Number of samples")
plt.ylabel("Severity (range in cells/mL)")
plt.title("Train labels severity level counts")
Out[14]:
Text(0.5, 1.0, 'Train labels severity level counts')
In [15]:
train_labels.density.describe()
Out[15]:
count    1.706000e+04
mean     1.074537e+06
std      6.836693e+06
min      0.000000e+00
25%      4.066000e+03
50%      3.270975e+04
75%      4.849192e+05
max      8.046675e+08
Name: density, dtype: float64

We can see that the density is highly skewed to the right, with a long tail of very high values.

There are also some rows where the density of cyanobacteria is 0.

In [16]:
(train_labels.density == 0).sum()
Out[16]:
91

submission_format.csv

Now let's look at the submission format. To create your own submission, download the submission format and replace the severity column with your own prediction values. To be scored correctly, the order of your predictions must match the order of uids in the submission format.

In [17]:
submission_format = pd.read_csv(DATA_DIR / "submission_format.csv", index_col=0)
submission_format.head()
Out[17]:
region severity
uid
aabn west 1
aair west 1
aajw northeast 1
aalr midwest 1
aalw west 1
In [18]:
submission_format.shape
Out[18]:
(6510, 2)

Like the training data, each row is a unique combination of location and date. The columns in submission_format.csv are:

  • uid (str): unique ID for each row. The uid maps each row in train_labels.csv to metadata.csv

  • region (str): US region in which the sample was taken. Scores are calculated separately for each of these regions, and then averaged. See the Problem Description page for details.

  • severity (int): placeholder for severity level based on the cyanobacteria density - all values are 0. This is the column that you will replace with your own predictions to create a submission. Participants should submit predictions for severity level, NOT for the raw cell density value in cells per mL.

Process feature data

Feature data is not provided through the competition site. Instead, participants will access all feature data through external, publicly available APIs. Relevant imagery can be identified using the location and date of each sample, listed in metadata.csv.

In this section, we'll walk through the process of pulling in matching satellite imagery using Microsoft's Planetary Computer. We'll only use Sentinel-2 L2A and Landsat Level-2 satellite imagery. Sentinel-2 L1C and Landsat Level-1 imagery are also allowed. For links to more guides on how to pull feature data, see the Problem Description page. There are a number of optional sources in addition to satellite imagery that you can also experiment with!

The general steps we'll use to pull satellite data are:

  1. Establish a connection to the Planetary Computer's STAC API using the planetary_computer and pystac_client Python packages.

  2. Query the STAC API for scenes that capture our in situ labels. For each sample, we'll search for imagery that includes the sample's location (latitude and longitude) around the date the sample was taken. In this benchmark, we'll use only Sentinel-2 L2A and Landsat Level-2 data.

  3. Select one image for each sample. We'll use Sentinel-2 data wherever it is available, because it is higher resolution. We'll have to use Landsat for data before roughly 2016, because Sentinel-2 was not available yet.

  4. Convert the image to a 1-dimensional list of features that can be input into our tree model

We'll first walk through these steps for just one example measurement. Then, we'll refactor the process and run it on all of our metadata.

In [19]:
# Establish a connection to the STAC API
import planetary_computer as pc
from pystac_client import Client

catalog = Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1", modifier=pc.sign_inplace
)

Walk through pulling feature data for one sample

Query the STAC API

In [20]:
example_row = metadata[metadata.uid == "garm"].iloc[0]
example_row
Out[20]:
uid                         garm
latitude                41.98006
longitude             -110.65734
date         2021-09-27 00:00:00
split                      train
season                      fall
Name: 5461, dtype: object

We can search (41.98006, -110.65734) in Google maps to see example where this sample is from. Turns out it's from Lake Viva Naughton in Wyoming!

Location range: We'll search an area with 50,000 meters on our either side of our sample point (100,000 m x 100,000 m), to make sure we're pulling in all relevant imagery. This is just a starting point, and you can improve on how to best search for the correct location in the Planetary Computer.

In [21]:
import geopy.distance as distance
In [22]:
# get our bounding box to search latitude and longitude coordinates
def get_bounding_box(latitude, longitude, meter_buffer=50000):
    """
    Given a latitude, longitude, and buffer in meters, returns a bounding
    box around the point with the buffer on the left, right, top, and bottom.

    Returns a list of [minx, miny, maxx, maxy]
    """
    distance_search = distance.distance(meters=meter_buffer)

    # calculate the lat/long bounds based on ground distance
    # bearings are cardinal directions to move (south, west, north, and east)
    min_lat = distance_search.destination((latitude, longitude), bearing=180)[0]
    min_long = distance_search.destination((latitude, longitude), bearing=270)[1]
    max_lat = distance_search.destination((latitude, longitude), bearing=0)[0]
    max_long = distance_search.destination((latitude, longitude), bearing=90)[1]

    return [min_long, min_lat, max_long, max_lat]


bbox = get_bounding_box(example_row.latitude, example_row.longitude, meter_buffer=50000)
bbox
Out[22]:
[-111.26063646639783,
 41.52988747516146,
 -110.05404353360218,
 42.43019710235757]

Time range: We want our feature data to be as close to the time of the sample as possible, because in algal blooms in small water bodies form and move very rapidly. Remember, you cannot use any data collected after the date of the sample.

Imagery taken with roughly 10 days of the sample will generally still be an accurate representation of environmental conditions at the time of the sample. For some data points you may not be able to get data within 10 days, and may have to use earlier data. We'll search the fifteen days up to the sample time, including the sample date.

In [23]:
# get our date range to search, and format correctly for query
def get_date_range(date, time_buffer_days=15):
    """Get a date range to search for in the planetary computer based
    on a sample's date. The time range will include the sample date
    and time_buffer_days days prior

    Returns a string"""
    datetime_format = "%Y-%m-%dT"
    range_start = pd.to_datetime(date) - timedelta(days=time_buffer_days)
    date_range = f"{range_start.strftime(datetime_format)}/{pd.to_datetime(date).strftime(datetime_format)}"

    return date_range


date_range = get_date_range(example_row.date)
date_range
Out[23]:
'2021-09-12T/2021-09-27T'
In [24]:
# search the planetary computer sentinel-l2a and landsat level-2 collections
search = catalog.search(
    collections=["sentinel-2-l2a", "landsat-c2-l2"], bbox=bbox, datetime=date_range
)

# see how many items were returned
items = [item for item in search.get_all_items()]
len(items)
Out[24]:
46

Select one image

The planetary computer returned 46 different items! Let's look at some of the details of the items that were found to match our label.

Remember that our example measurement was taken on 2021-09-27 at coordinates (41.98006, -110.65734). Because we used a bounding box around the sample to search, the Planetary Computer returned all items that contain any part of that bounding box. This means we still have to double check whether each item actually contains our sample point.

In [25]:
# get details of all of the items returned
item_details = pd.DataFrame(
    [
        {
            "datetime": item.datetime.strftime("%Y-%m-%d"),
            "platform": item.properties["platform"],
            "min_long": item.bbox[0],
            "max_long": item.bbox[2],
            "min_lat": item.bbox[1],
            "max_lat": item.bbox[3],
            "bbox": item.bbox,
            "item_obj": item,
        }
        for item in items
    ]
)

# check which rows actually contain the sample location
item_details["contains_sample_point"] = (
    (item_details.min_lat < example_row.latitude)
    & (item_details.max_lat > example_row.latitude)
    & (item_details.min_long < example_row.longitude)
    & (item_details.max_long > example_row.longitude)
)

print(
    f"Filtering from {len(item_details)} returned to {item_details.contains_sample_point.sum()} items that contain the sample location"
)

item_details = item_details[item_details["contains_sample_point"]]
item_details[["datetime", "platform", "contains_sample_point", "bbox"]].sort_values(
    by="datetime"
)
Filtering from 46 returned to 7 items that contain the sample location
Out[25]:
datetime platform contains_sample_point bbox
44 2021-09-12 landsat-8 True [-111.34023745, 40.67702495, -108.51528663, 42...
34 2021-09-14 Sentinel-2A True [-111.000244, 41.45624189757688, -109.665115, ...
21 2021-09-19 Sentinel-2B True [-111.000244, 41.45624189757688, -109.665115, ...
26 2021-09-19 landsat-8 True [-112.92522743, 40.67353498, -110.06644659, 42...
18 2021-09-20 landsat-7 True [-111.60018779, 40.80284495, -108.59831681, 42...
8 2021-09-24 Sentinel-2A True [-111.000244, 41.4562419, -109.665115, 42.4526...
5 2021-09-27 landsat-7 True [-113.18385774, 40.78803499, -110.15150674, 42...

To keep things simple in this benchmark, we'll just choose one to input into our benchmark model. Note that in your solution, you could find a way to incorporate multiple images!

We'll narrow to one image in two steps:

  1. If any Sentinel imagery is available, filter to only Sentinel imagery. Sentinel-2 is higher resolution than Landsat, which is extremely helpful for blooms in small water bodies. In this case, two images are from Sentinel and contain the actual sample location.

  2. Select the item that is the closest time wise to the sampling date. This gives us a Sentinel-2A item that was captured on 10/20/2022 - two days before our sample was collected on 10/22.

This is a very simple way to choose the best image. You may want to explore additional strategies like selecting an image with less cloud cover obscuring the Earth's surface (as in this tutorial).

In [26]:
# 1 - filter to sentinel
item_details[item_details.platform.str.contains("Sentinel")]
Out[26]:
datetime platform min_long max_long min_lat max_lat bbox item_obj contains_sample_point
8 2021-09-24 Sentinel-2A -111.000244 -109.665115 41.456242 42.452691 [-111.000244, 41.4562419, -109.665115, 42.4526... <Item id=S2A_MSIL2A_20210924T181101_R084_T12TW... True
21 2021-09-19 Sentinel-2B -111.000244 -109.665115 41.456242 42.452691 [-111.000244, 41.45624189757688, -109.665115, ... <Item id=S2B_MSIL2A_20210919T180919_R084_T12TW... True
34 2021-09-14 Sentinel-2A -111.000244 -109.665115 41.456242 42.452691 [-111.000244, 41.45624189757688, -109.665115, ... <Item id=S2A_MSIL2A_20210914T180951_R084_T12TW... True
In [27]:
# 2 - take closest by date
best_item = (
    item_details[item_details.platform.str.contains("Sentinel")]
    .sort_values(by="datetime", ascending=False)
    .iloc[0]
)
best_item
Out[27]:
datetime                                                        2021-09-24
platform                                                       Sentinel-2A
min_long                                                       -111.000244
max_long                                                       -109.665115
min_lat                                                          41.456242
max_lat                                                          42.452691
bbox                     [-111.000244, 41.4562419, -109.665115, 42.4526...
item_obj                 <Item id=S2A_MSIL2A_20210924T181101_R084_T12TW...
contains_sample_point                                                 True
Name: 8, dtype: object

This gives us a Sentinel-2A item that was captured on 9/24/2021 -- three days before our sample was collected on 9/27/2021.

In [28]:
item = best_item.item_obj

This item comes with a number of "assets". Let's see what these are.

In [29]:
# What assets are available?
for asset_key, asset in item.assets.items():
    print(f"{asset_key:<25} - {asset.title}")
AOT                       - Aerosol optical thickness (AOT)
B01                       - Band 1 - Coastal aerosol - 60m
B02                       - Band 2 - Blue - 10m
B03                       - Band 3 - Green - 10m
B04                       - Band 4 - Red - 10m
B05                       - Band 5 - Vegetation red edge 1 - 20m
B06                       - Band 6 - Vegetation red edge 2 - 20m
B07                       - Band 7 - Vegetation red edge 3 - 20m
B08                       - Band 8 - NIR - 10m
B09                       - Band 9 - Water vapor - 60m
B11                       - Band 11 - SWIR (1.6) - 20m
B12                       - Band 12 - SWIR (2.2) - 20m
B8A                       - Band 8A - Vegetation red edge 4 - 20m
SCL                       - Scene classfication map (SCL)
WVP                       - Water vapour (WVP)
visual                    - True color image
preview                   - Thumbnail
safe-manifest             - SAFE manifest
granule-metadata          - Granule metadata
inspire-metadata          - INSPIRE metadata
product-metadata          - Product metadata
datastrip-metadata        - Datastrip metadata
tilejson                  - TileJSON with default rendering
rendered_preview          - Rendered preview

We have visible bands (red, green, and blue), as well as a number of other spectral ranges and a few algorithmic bands. The Sentinel-2 mission guide has more details about what these bands are and how to use them!

A few of the algorithmic bands that may be useful are:

  • Scene classification (SCL): The scene classification band sorts pixels into categories including water, high cloud probability, medium cloud probability, and vegetation. Water pixels could be used to calculate the size of a given water body, which impacts the behavior of blooms. Vegetation can indicate non-toxic marine life like sea grass that sometimes resembles cyanobacteria.

  • Cloud masks (CLM): Sentinel-2’s cloud bands can be used to filter out cloudy pixels. Pixels that are obscured by clouds are likely not helpful for algal bloom detection because the actual water surface is not captured.

We won't explore these in the benchmark, but you can go wild in your solution!

Visualizing Sentinel-2 imagery

Let's see what the imagery actually looks like.

In [30]:
import rioxarray
from IPython.display import Image
from PIL import Image as PILImage
In [31]:
# see the whole image
img = Image(url=item.assets["rendered_preview"].href, width=500)

Image(url=item.assets["rendered_preview"].href, width=500)
Out[31]:

That's a LOT of area, and all we care about is right around our sampling point. Let's get a closer to look to make sure we're in the right neighborhood.

In [32]:
def crop_sentinel_image(item, bounding_box):
    """
    Given a STAC item from Sentinel-2 and a bounding box tuple in the format
    (minx, miny, maxx, maxy), return a cropped portion of the item's visual
    imagery in the bounding box.

    Returns the image as a numpy array with dimensions (color band, height, width)
    """
    (minx, miny, maxx, maxy) = bounding_box

    image = rioxarray.open_rasterio(pc.sign(item.assets["visual"].href)).rio.clip_box(
        minx=minx,
        miny=miny,
        maxx=maxx,
        maxy=maxy,
        crs="EPSG:4326",
    )

    return image.to_numpy()
In [33]:
# get a smaller geographic bounding box
minx, miny, maxx, maxy = get_bounding_box(
    example_row.latitude, example_row.longitude, meter_buffer=3000
)

# get the zoomed in image array
bbox = (minx, miny, maxx, maxy)
zoomed_img_array = crop_sentinel_image(item, bbox)

zoomed_img_array[0]
Out[33]:
array([[214, 224, 213, ..., 124, 109, 103],
       [215, 220, 201, ..., 114,  95,  94],
       [217, 222, 187, ..., 106,  95,  94],
       ...,
       [223, 185, 193, ..., 112, 118, 120],
       [181, 140, 190, ..., 110, 111, 117],
       [134, 125, 185, ..., 107, 107, 111]], dtype=uint8)
In [34]:
# we have to transpose some of the dimensions to plot
# matplotlib expects channels in a certain order
plt.imshow(np.transpose(zoomed_img_array, axes=[1, 2, 0]))
Out[34]:
<matplotlib.image.AxesImage at 0x7f7ebc07f760>

To make sure this imagery looks right, we can search for the sample in Google maps with (latitude, longitude) -- in this case, we'll search (41.98006, -110.65734). That's a match!

Visualizing Landsat imagery

STAC items from Sentinel and from Landsat are slightly different, so we'll have to use slightly different methods to pull imagery based on its satellite source.

Let's look at one of the Landsat items returned and see how to visualize it. We'll use the odc-stac package.

In [35]:
landsat_item = (
    item_details[item_details.platform.str.contains("landsat")]
    .sample(n=1, random_state=3)
    .iloc[0]
)
landsat_item
Out[35]:
datetime                                                        2021-09-12
platform                                                         landsat-8
min_long                                                       -111.340237
max_long                                                       -108.515287
min_lat                                                          40.677025
max_lat                                                          42.817595
bbox                     [-111.34023745, 40.67702495, -108.51528663, 42...
item_obj                         <Item id=LC08_L2SP_037031_20210912_02_T1>
contains_sample_point                                                 True
Name: 44, dtype: object
In [36]:
def crop_landsat_image(item, bounding_box):
    """
    Given a STAC item from Landsat and a bounding box tuple in the format
    (minx, miny, maxx, maxy), return a cropped portion of the item's visual
    imagery in the bounding box.

    Returns the image as a numpy array with dimensions (color band, height, width)
    """
    (minx, miny, maxx, maxy) = bounding_box

    image = odc.stac.stac_load(
        [pc.sign(item)], bands=["red", "green", "blue"], bbox=[minx, miny, maxx, maxy]
    ).isel(time=0)
    image_array = image[["red", "green", "blue"]].to_array().to_numpy()

    # normalize to 0 - 255 values
    image_array = cv2.normalize(image_array, None, 0, 255, cv2.NORM_MINMAX)

    return image_array
In [37]:
item = landsat_item.item_obj

# we'll use the same cropped area as above
landsat_image_array = crop_landsat_image(item, bbox)
landsat_image_array[0]
Out[37]:
array([[104, 108,  94, ...,  85,  86,  88],
       [106, 110,  95, ...,  86,  89,  87],
       [106, 105,  94, ...,  89,  89,  85],
       ...,
       [106, 100,  96, ...,  90,  89,  90],
       [ 95,  93,  92, ...,  89,  88,  88],
       [113, 105, 103, ...,  88,  87,  89]], dtype=uint16)
In [38]:
plt.imshow(np.transpose(landsat_image_array, axes=[1, 2, 0]))
Out[38]:
<matplotlib.image.AxesImage at 0x7f7edb1bad60>

Note that unlike Sentinel-2 imagery, Landsat imagery is not originally returned with image values scaled to 0-255. Our function above scales the pixel values with cv2.normalize(image_array, None, 0, 255, cv2.NORM_MINMAX) so that it is more comparable to our Sentinel-2 imagery, and we can input both of them as features into our model. You may want to explore other methods of converting Sentinel and Landsat imagery to comparable scales to make sure that no information is lost when re-scaling.

For an example, let's look at the Landsat image for the item shown above before it is rescaled to 0-255.

In [39]:
# load image but don't convert to numpy or rescale
image = odc.stac.stac_load(
    [pc.sign(item)], bands=["red", "green", "blue"], bbox=bbox
).isel(time=0)
image_array = image[["red", "green", "blue"]].to_array()

# values are not scaled 0 - 255 when first returned
image_array[0]
Out[39]:
<xarray.DataArray (y: 202, x: 202)>
array([[12933, 13488, 11722, ..., 10544, 10770, 10988],
       [13164, 13673, 11810, ..., 10660, 11079, 10844],
       [13250, 13085, 11760, ..., 11094, 11137, 10551],
       ...,
       [13158, 12414, 11900, ..., 11263, 11087, 11157],
       [11882, 11630, 11493, ..., 11082, 10912, 10931],
       [14101, 13137, 12880, ..., 10945, 10871, 11113]], dtype=uint16)
Coordinates:
  * y            (y) float64 4.651e+06 4.651e+06 ... 4.645e+06 4.645e+06
  * x            (x) float64 5.254e+05 5.254e+05 ... 5.314e+05 5.314e+05
    spatial_ref  int32 32612
    time         datetime64[ns] 2021-09-12T18:01:53.979983
    variable     <U3 'red'
In [40]:
# image appears differently without rescaling
fig, ax = plt.subplots(figsize=(5, 5))
image_array.plot.imshow(robust=True, ax=ax)
Out[40]:
<matplotlib.image.AxesImage at 0x7f7ebc599d30>