blog insightstools

Bringing small water bodies into view: Sentinel-2 satellite monitoring of harmful algal blooms (HABs)

CyFi enhances modern HAB monitoring programs by extending their reach and informing field-based components.

Emily Dorne
Lead Data Scientist

Blooms as seen from Sentinel-2 A view of smaller water bodies, some with HABs, as captured by Sentinel-2

For years, harmful algal bloom (HAB) monitoring has lived with a simple but frustrating truth: our tools work well, just not at the spatial and temporal scales we need them to. Large lakes and coastal systems can be monitored with sensors, such as Imaging Flow Cytobots (IFCBs), and remote sensing tools such as the EPA’s CyANWeb. Unfortunately, ~98% of U.S. lakes are too small for effective HAB detection using CyAN because of Sentinel-3’s 300m spatial resolution.

As a result, HAB monitoring for small lakes, reservoirs, rivers, and inland freshwater systems relies on labor-intensive field sampling programs and fixed-sensor systems, each with limitations:

  • Precision measurements obtained from field monitoring programs are limited by geographic coverage, sampling intervals, access to water bodies, volunteer or paid staff participation and coordination, and lab resources.
  • Precision measurements from fixed sensor systems are limited by trade-offs between system costs, pinpoint measurement, and the number of sampling sensors needed for representative areal sampling.

In short, there is a gap in the tools available to state and regional HAB monitoring programs: the ability to monitor the nation’s hundreds of thousands of smaller water bodies (<250 acres) across large regions at high spatial and temporal resolution.

A new tool, CyFi—short for Cyanobacteria Finder—was developed to close that gap. It’s a machine-learning model packaged as an open-source command-line tool that uses Sentinel-2 satellite 10-30m resolution data to identify HABs in inland water bodies of less than ~250 acres (100 hectares).

The initial models that led to CyFi are the outcome of an open machine learning competition, hosted by DrivenData and funded by NASA, with collaboration from NOAA, EPA, USGS, DOD's Defense Innovation Unit, Berkley AI Research, and Microsoft AI for Earth. CyFi was developed through multiple phases of research and development.

CyFi development

This blog presents the story of how CyFi enhances modern HAB monitoring approaches, extending the reach of HAB monitoring systems while complementing and strengthening all other components.

The problem: The smaller water bodies at risk for HABs are the hardest to monitor

The difficulty starts with scale—the number of water bodies, their geographic distribution over large areas, and their smaller size.

In situ sampling: Pinpoint accuracy and precision, but challenging to scale

Field teams have always been the backbone of freshwater HAB monitoring. They take water samples, ship them to labs, and get back precise cyanobacteria cell counts and toxin measurements. Similarly, fixed sensor systems installed in water bodies offer precision and continuous monitoring. But precision comes at a cost. One or more of these limitations apply:

  • Sensor equipment is too expensive to deploy for hundreds or thousands of lakes
  • Manual field sampling is too resource-intensive for broad area, frequent, and routine surveillance
  • Pinpoint samples do not accurately represent the areal extent of HAB

In situ sampling is irreplaceable for confirming toxicity and regulatory thresholds. But it can’t be everywhere at once.

Satellite monitoring: Frequent and large area coverage, but with limitations on resolution and precision

Remote sensing promises an answer—continuous, passive, wide-area imagery. But inland lakes and water bodies smaller than ~250 acres represent a major limitation: the satellites that can detect cyanobacteria aren’t sharp enough to resolve smaller water bodies.

  • Sentinel-3 has the spectral bands needed to detect cyanobacteria-specific pigments, such as phycocyanin. Problem: 300–500 m resolution completely washes out most small lakes and rivers.

  • Sentinel-2 has the spatial resolution (10–30 m) needed to clearly see small water bodies. Problem: Its broad spectral bands don’t isolate cyanobacteria, making direct detection unreliable.

Satellite imagery tradeoffs A visual comparison of Sentinel-3 resolution (left) and Sentinel-2 resolution (right)

Monitoring small water bodies at scale became a classic “pick two” dilemma. But for comprehensive and responsive HAB monitoring, you need all three.

Approach Scalable? Works on small lakes? Cyanobacteria-specific?
In situ
Sentinel-3
Sentinel-2

Potential HAB monitoring breakthrough: Machine learning with Sentinel-2 satellite data

Instead of relying on direct spectral detection of cyanobacteria, AI models can infer bloom conditions from combinations of Sentinel-2’s broad-band features. A well-trained model effectively translates Sentinel-2 imagery into an estimate of cyanobacteria density—even though the satellite itself lacks cyanobacteria-specific bands.

This is the core insight behind CyFi.

What CyFi does

CyFi is an open-source Python package that:

  1. Takes simple input—location and date.
  2. Downloads the relevant Sentinel-2 imagery.
  3. Generates cyanobacteria cell density estimates.
  4. Assigns a severity level using WHO guidelines (or custom cutoffs).
  5. Outputs both data and optional overlays on satellite imagery.

The data behind CyFi: National-scale training and testing

CyFi is grounded in one of the largest curated datasets of in-situ cyanobacteria measurements from across the United States:

  • ~9,000 observations in the training set
  • ~4,000 observations in the test set
  • Carefully vetted data from 14 organizations including state agencies, federal programs, and scientific monitoring networks
  • Published publicly via NASA’s SeaBASS archive

This breadth makes the model robust across diverse water types and seasonal conditions.

Map of training and test set locations

While CyFi was developed from both national and state datasets, we recognize that results may vary when applied to local contexts. We recommend that each monitoring program perform validation testing using its data. The DrivenData team is available to help any team test and onboard CyFi.

How CyFi compares to CyAN

A natural benchmark is Cyanobacteria Index (CI) from CyAN, which is built on Sentinel-3.

When comparing 756 paired observations from U.S. lakes large enough for Sentinel-3 to capture, CyFi's bloom detection (presence/absence accuracy) was 72% and CyAN’s was 66%. In other words, CyFi performs on par with the leading Sentinel-3-based method—but covers 10x more water bodies thanks to Sentinel-2’s higher resolution. For context, there are more than 400,000 lakes under 250 acres in the contiguous US and these account for 98% of CONUS lakes.

CyFi and CyAN accuracy comparison CyFi performs at least as well as Sentinel-3 based tools but has 10x greater coverage of lakes. Accuracy was evaluated using on a dataset of 756 ground measurements from across the U.S.

Where CyFi fits within the typical HAB monitoring ecosystem

CyFi doesn't replace manual sampling, Sentinel-3 algorithms or on-site sensors. Instead, it complements each of them by filling crucial observational gaps and helping improve the overall system.

1. Manual sampling + CyFi

Use CyFi for statewide or regional passive coverage. Use sampling for confirmation, toxin analysis, and regulatory decisions. CyFi helps teams decide where to sample—increasing impact without increasing budgets.

2. Sentinel-3 remote sensing + CyFi

Sentinel-3 provides useful, accurate coverage for large lakes, oceans, and coastal areas where phycocyanin detection is possible. CyFi extends this insight to small inland and freshwater bodies where Sentinel-3 cannot resolve features. Seen together, they provide tiered coverage for all freshwater lake, river, and pond sizes.

3. Sensors and buoys + CyFi

In situ instruments capture detailed dynamics in high-priority locations. CyFi identifies hotspots and seasonal patterns across the broader landscape. It becomes a wider net that helps managers identify gaps in current in situ systems and confirm where high-intensity monitoring is worthwhile and cost-effective.

Adding CyFi to an existing HAB monitoring system can provide weekly HAB estimates for every lake in a region

Because CyFi produces structured predictions, it integrates cleanly into dashboards, alerts, and workflows—turning satellite data into actionable information for water managers, public health officials, and environmental agencies.

With CyFi integrated into an in situ monitoring system, agencies can now operate with broad, repeatable, near-real-time situational awareness that complements existing regimes and directs attention where it's needed, with these potential incremental benefits:

  • More informed water body and beach closures from area estimates of HAB
  • Faster and more comprehensive drinking water advisories by increasing the frequency of sampling over a larger area
  • More targeted and strategic sampling plans for in situ monitoring, with potential to further optimize placement of sensors and better direct manual field teams

The result is more confidence in identifying the places that are and aren’t blooming.

How to add remote sensing to your HAB monitoring system

CyFi is a relatively new offering for the HAB monitoring stack. We’re looking for partnerships to strengthen and verify CyFi’s contribution. These high-impact options are ready to pursue immediately through partnerships:

1. Ground-truthing and training CyFi with new field samples

If your organization collects cyanobacteria samples, we can compare CyFi predictions against your data to evaluate performance and establish baseline accuracy. For your implementation, whether in or outside the U.S., we recommend retraining CyFi on your data for highest accuracy.

2. Integrating CyFi outputs into monitoring dashboards or workflows

If you manage a large region and need process automation, we can help you integrate CyFi for automated, large-area weekly monitoring that feeds directly into your dashboards and decision-support systems.

DrivenData is looking for partners to validate and test advanced HAB monitoring systems

TKTK

The python package is just a building block. We want to collaborate with freshwater monitoring programs to further validate, improve, and deploy CyFi. CyFi is open source and freely available for exploration and experimentation.

We invite you to reach out to partner with us for validation, dashboard integration, a customized CyFi HAB monitoring deployment, or any manner of technical support from the team that built CyFi.

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

insights

Bringing small water bodies into view: Sentinel-2 satellite monitoring of harmful algal blooms (HABs)

CyFi enhances modern HAB monitoring programs by extending their reach and informing field-based components.

insights

Solving the last-mile public data problem

Using "baked" data to transform public data repositories into analysis-ready resources

media

DrivenData Joins U.S. Department of Energy's Genesis Mission to Advance AI for Science and the Public Good

Social impact data science organization brings decade of federal open innovation experience to historic national initiative

winners

Meet the winners of Phase 3 of the PREPARE Challenge

Learn how teams developed proof-of-concept approaches for real-world early Alzheimer's prediction

winners

Meet the winners of the AI for Advancing Instruction Challenge

Learn how the winners of the AIAI challenge leveraged multimodal classroom data to identify instructional activities and classroom discourse content.

case studies

Automating wildlife monitoring with Zamba & Zamba Cloud

DrivenData partnered with conservation researchers to create Zamba, an open-source machine learning solution that helps wildlife researchers process camera trap footage, reducing months of manual review to hours of automated analysis.

community

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

The Community Spotlight features fantastic members from our DrivenData community. Three members of the IGCPHARMA team, Paola Ruiz, Néstor González, and Daniel Crovo talk to us about data science, drug discovery, diverse databases and more!

community

Community Spotlight: Kirill Brodt

The Community Spotlight features fantastic members from our DrivenData community. Kirill Brodt, a researcher in computer graphics at the University of Montreal, talks animation, pose estimation, and data science challenges.

case studies

Jump-starting data infrastructure and in-house data expertise

DrivenData designed and built a data warehouse to centralize, organize, and visualize data across CodePath's operations. Our team also provided technical hiring assistance to find the right talent to carry the work forward.

case studies

A production application to support survivors of human trafficking

DrivenData developed Freedom Lifemap, a digital tool designed to support survivors of human trafficking on their journey toward reintegration and independence.

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

insights

(Tech) Infrastructure Week for the Nonprofit Sector

Reflections on how to build data and AI infrastructure in the social sector that serves the needs of nonprofits and their beneficiaries.

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

insights

AI sauce on everything: Reflections on ASU+GSV 2025

Data, evaltuation, product iteration, and public goods: reflections on the ASU+GSV Summit 2025.

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

case studies

Crowdsourcing solutions for AI-assisted early literacy screening

DrivenData ran a machine learning competition to develop models for scoring audio recordings from literacy screener exercises completed by students in kindergarten through 3rd grade.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.