blog winnerscompetition

Random walk of the penguins

The goal of this data science competition was to reach out to the data science community to build a model that predicts penguin populations.

We're excited to feature a guest post by Dr. Grant Humphries. Dr Humphries is a seabird biologist by training and has been working with machine learning algorithms for about a decade. He is the lead web developer for the MAPPPD project, and currently runs a small data science company (Blackbawks data science) in the highlands of Scotland.


Antarctic penguins are some of the most well-known and charismatic megafauna on the planet. They have been popularised by their unusual behaviour and attitudes, which have been captured by wildlife videographers and scientists. Most people are probably familiar with them thanks to the soothing tones of Morgan Freeman's voice in "March of the Penguins", or the catchy music from "Happy Feet". Despite all the attention, there is still a lot we do not know about the four Antarctic penguin species; Adélie, Chinstrap, Gentoo, and Emperor penguins.

A penguin

The goal of this data science competition was to reach out to the data science community to build a model that predicts penguin populations. The prediction of changes in penguin populations is important for the long-term management of these species, and important for understanding the entire ecosystem. The data we provided comes from a long-term (nearly 30 years) effort to monitor penguin populations, as well as other long-term datasets published by penguin researchers around the world. All the data used for the competition are freely available at MAPPPD.

A lone human looks over a sea of penguins

The winning submissions were incredibly interesting, and helped to validate some known issues around population modelling of Antarctic penguins.

Not only did the data science competition demonstrate the use of algorithms not commonly applied in ecological studies, but it has also provided us with a suite of models that can be incorporated into ensemble model forecasts for the Antarctic conservation community.

Though the winning approaches differed, a common theme emerged: the best predictors for abundance in the near future are the recent past, whether that be the previous year's abundance, trend over recent years, or in one case, approximate growth rate over time. On the one hand, this is not surprising, as the abundance of animals in a closed population are strongly auto-correlated. However, we were surprised at how difficult it was to find environmental covariates that improved model performance. Ecologists generally think animal abundances are controlled by environmental factors, since environmental conditions play a large role in where a species can be found. However, the link between those environmental factors that we can measure at large scale and abundance appears quite weak. There are many reasons why this might be.

One major reason is that our ability to measure environmental conditions over large areas is limited to what can be measured by satellites, and fine-scale phenomena, such as precipitation, may be more important that anything apparent from satellites. The second reason is that animal abundance is the net result of a multitude of factors that influence both survival and reproduction in the past, and so it is difficult to reverse engineer from the time series what factors may have played the largest role in the number of penguins arriving to breed each year. While ecologists have long struggled to understand these 'noisy' time series, this data science competition allows us to explore a massive number of alternative prediction approaches unbiased by our own "prior expectations" for how the system should work.

The results of the competition clarify several key issues for modeling time series of penguin abundance: 1) environmental drivers that can be measured at large spatial scales explain little year to year variation in populations, though extreme events (e.g., severe El Nino or unusual sea ice dynamics) might be detectable as a brief 'blip' in abundance; 2) In the short term (up to five years or so), the best predictions of population change are actually quite straightforward and can be easy integrated into management of marine resources. Moving forward, we advise greater caution in interpreting short term variations uncovered by regular monitoring of populations, since significant unexplained variation can mask long-term signals of change.

Competition leaderboard

The benchmark set by DrivenData scored an 8.1101 % error based on the Adjusted Mean Absolute Percent Error (AMAPE). This was surprisingly higher than anticipated, but gave us a good place to start as submissions came through. The first-place submission nearly halved the AMAPE value, and we were excited to see many competitors managed to beat the benchmark. Over 600 models were submitted, and we would like to greatly thank all those who participated. The top five modelers submitted reports, and we list those below. From those, we selected two who provided interesting biological/ecological evidence for their approach.

Meet the winners

Ambarish Ganguly

Place: 1st

Ambarish's background: I am an IT consultant with around 18 years of industry experience. For the last 10 years, I have been consulting with utility (electricity, gas and water) companies around the globe in USA, Canada and Australia focusing on energy trading, customer care and billing and home services. I am a self-confessed MOOC addict ☺ enrolling myself in more MOOCs than I can ever complete.

Method overview: The solution is divided into 2 major parts - Imputation of the data - Model Building

Imputation of the data - Before running the model, imputation of the data was done for each of the site and penguin type combinations. The imputations were done in the following order - Stine in case of R models and Linear in case of Python models - Last Observation Carried Forward in case of R models only - Next Observation Carried Backward in case of R and Python models - Replace by Zero in case of R models and Python models

Model Building - Built 5 models for each of the site and penguin type combination. Therefore for each of the 648 combinations of site and penguin type, the following models were built - XGBoost in Python - RandomForest in Python - ARIMA in R - ETS in R - Prophet in R

An average for all of these models was created for every 648 combinations.

Ambarish model diagram

Thomas Bolton

Place: 2nd + panel selection winner

Background: My name is Tom, and I'm currently a physics PhD student at the University of Oxford. My research is in physical oceanography, in particular Gulf Stream turbulence. My day-to-day work involves statistical analysis of observational datasets and model output, as well as some theoretical approaches. In my spare time I'm a keen cyclist.

Method: My method involved a mix of persistence, auto-regressive models, linear regressions and exponential models. As the challenge involved multiple time series, the particular method applied depended on my judgement, after manual inspection of the data. Tailoring the approach for each time-series individually proved successful as opposed for a single technique across all of them. The behaviour of the time-series varied massively between sites and species. One time-series might show a linear increase due to increased food availability, where a neighbouring site may exhibit exponential decrease due to humans entering the region; with so many potential factors affect the inter-annual variability of nest counts, it made the challenge particularly difficult.

Aharon Ravia

Place: 3rd

Background: I am Aharon Ravia, 32 years old from Israel. Currently, a PhD candidate in the neurobiology department at the Weizmann Institute of Science, studying computational models for human olfaction (the sense of smell). Throughout my academic and professional career, I have studied math & neuroscience and was employed as a data scientist, and as a data science team manager. I am also a huge fan of penguins, watched them in Australia and New Zealand, but had not yet been to Antarctica.)

Method: My first entry to the competition was late, less than 2 weeks before the deadline. As I have seen the amount of the data, which was quite sparse, I thought that there is not much time to go beyond simple methods. My experience with time series data had thought me that the best estimation for next step change is 0. So I predicted the last known result for each location as the estimation for the years not revealed. Astonishingly this approach alone did quite well and resulted is 16-20 place in the leaderboard. This thing has thought me few things, the first, that it was also significantly better than the linear fit solution, and many other submissions, so significant that I ruled out (almost completely) the possibility that it won't be the case on the private leaderboard set. In addition, the AMAPE estimation I had on my training set was so far away from the result on the leaderboard set, that I realized I have to relate carefully to my validation result.

I wanted to further combine in the model 3 different type of models. The first is estimation of the mean, which is done by last estimation (could be done as an average of last N observations). The second is a recent trend, changes in recent years. The third is the whole time series trend. This approach was partially adapted from Facebook's Prophet blog post. I also wanted to study more from the public leaderboard set (with a risk of overfit) whether there was an additional general trend of increase/decrease that years. Eventually I combined the constant estimation, with a minor general trend and a linear fit that was computed differently for each year.

Oleg Panichev

Place: 4th

Background: Last seven years I am working in the field of mathematical analysis of biomedical signals. The most interesting topics for me are a study of work of human brain and artificial intelligence. I believe that machine learning and artificial intelligence are one of the key technologies of our future. Last 3 years I am working on a Ph.D. thesis which is devoted to the prediction of epileptic seizures based on EEG analysis. Currently, I am working as a research engineer in R&D lab in Ciklum.

Method: My approach is based on Benchmark: Simple Linear Models – this is a great starting point for this problem. The first what I did is replaced of Linear Regression with XGBoost and minor parameters tuning. After that, I have looked around the preprocessing part and decided to improve function for replacing NaNs in the dataset. Actually, that didn't give any improvements but allowed to combine different models into an ensemble. Submission generated by ensembling of previous submissions resulted in 5.2132 on the public leaderboard and 4.9605 on the private.

Benjamin Carrión

Place: 5th + panel selection winner

Background: My name is Benjamin (Ben) Carrión. I hold a bachelor degree in hydraulic engineering and a Master of Science on coastal engineering. I work mostly as numerical modeller of coastal hydrodynamics and morphodynamics for a port and coastal consulting company (PRDW). As a result, I code and deal with data regularly. I'm quite new to the data science world, formally, but I wanted to see how far could I make it (in the competition) with my current background.

Method: The key element to realize is that, at least in this particular case of seasonal breeding in a rather extreme ecosystem, what you really want to estimate is the annual rate of change of populations rather than the actual populations.

My thought was that populations are in a delicate equilibrium between competitors, preys, and predators, which can be written down using differential equations, on which the absolute rate of change between breeding seasons (birth rate minus death rate) depends in some complex way on the other species' populations.

This differential system can be solved and the levels of different populations will change until an equilibrium is reached. However, given the seasonal character of Antarctica, the "time step for the integration", sort of speak, of these equations is fixed, and could be too large to reach smooth equilibrium. Instead, the population levels oscillate seemingly at random, but around a certain equilibrium value.

This might seem far-fetched, but it is (somewhat) what happens with lemming populations in Canada. The idea of "suicidal lemmings" arises from the large variations in their population from season to season. In the particular case of lemmings what happens is that they food (plants) grows at a slower rate than the lemmings are born. So, high natality increases the lemming population, which depletes their food supply, which doesn't grow as fast, so many lemmings die. This is a very rough sketch, but illustrates the idea. Specifically, it shows that the overall rate of change seems to depend largely on the previous' year population.

Now, for the case of the penguins it is hard to see any trend at all per nest, given the patchiness of the dataset. Some sites might have only a few observations, or even only one. Moreover, fewer observations were available for previous years too deep in the past. The dataset is far from homogeneous, and any method to fill it up might include large errors. Also, it wasn't clear that certain penguin types favoured larger or smaller groups: you could find some nest of a certain type with a few dozen individuals, or other nest of the same penguin type with a few thousand ones, even for close locations. In that regard, location didn't seem very relevant either.

However, if the mean count per penguin type are considered, trends are more visible: large variations around a rather stable average value. This is my main hypothesis. Now, the actual rates are still quite noisy, but correlate well with the average mean population per site per penguin type of the previous year. Correlation with other environmental variables, such as ice extent or water/air temperature was not clearly seen, since these parameters also present a rather noisy behaviour (over a small mean trend: rising both temperatures and ice extent).

The train of thought was to forecast a mean rate per penguin type, and apply it at each site, using the last valid measurement. In fact, if you consider a rate of change 1.0, i.e. no change from the last valid value, you already get a good estimate for the future year populations. This indicates that (mean) rates seem to be rather close to 1.0, and what you should aim for is a method to estimate the deviation from that value, for each forecasted year.

Benjamin method graph
Benjamin method graph
Benjamin method graph


Thanks to all the participants and to our winners! Special thanks to NASA for supporting the MAPPPD project and this data science competition. Thanks also go to Dr Heather Lynch, who obtained the original funding for the MAPPPD project and offered input on this post, and to Ron Naveen, the President of Oceanites who has dedicated much time and effort into MAPPPD and Antarctic conservation. We look forward to working with the winners as we further analyse the model output.

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

tutorial

Goodnight Moon, Hello Early Literacy Screening Benchmark

In this guest post from the MIT Gabrieli Lab, we'll show you how to get started with the literacy screening challenge!

tutorial

Youth Mental Health: Automated Abstraction Benchmark

Learn how to process text narratives using open-source LLMs for the Youth Mental Health: Automated Abstraction challenge

winners

Meet the winners of Phase 1 of the PREPARE Challenge

Learn about the top datasets sourced for Phase 1 of the PREPARE Challenge.

resources

Teaching with DrivenData Competitions

Inspiration and resources for teaching students data science, machine learning, and AI skills with DrivenData competitions.

winners

Meet the winners of the Pose Bowl challenge

Learn about the top solutions submitted for the Pose Bowl: Spacecraft Detection and Pose Estimation Challenge.

winners

Meet the winners of the Water Supply Forecast Rodeo Hindcast Stage

Learn about the winning models for forecasting seasonal water supply from the first stage of the Water Supply Forecast Rodeo.

tools

Cookiecutter Data Science V2

Announcing the V2 release of Cookiecutter Data Science, the most widely adopted data science project template.

resources

How to make data science projects more open and inclusive

Key practices from the field of open science for making data science work more transparent, inclusive, and equitable.

winners

Meet the winners of the Kelp Wanted challenge

Dive into the solutions from the super segmenters who best detected kelp in Landsat imagery!

winners

Meet the winners of the SNOMED CT Entity Linking Challenge

Meet the winners with the best systems for detecting clinical terms in medical notes.

winners

Meet the winners of the Pale Blue Dot challenge

Learn about the top visuals created for the Pale Blue Dot: Visualization Challenge and the solvers behind them.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.