Life beyond the leaderboard

Organizations run AI competitions for a variety of reasons. They want to engage the expertise of a global community. They want to push the limits of available methods for their needs. They want to explore innovative approaches and surface new ideas. They want to benchmark the level of performance that can be achieved with their data.

At the end of a competition, these organizations get a few things:

Winning solutions, consisting of research code in a Github repository and often shared openly for ongoing learning and development
An understanding of the most effective approaches and measurement of what's possible
Recognition of top teams and their accomplishments, for instance through the leaderboard and winners announcement)
A well-structured use case for ongoing reference, shared on the challenge website and in materials like benchmark tutorials), and sometimes accompanied by open data

Example leaderboard from DrivenData’s challenge platform showing top performance achieved per team.

But what happens next?

This is the part that is often unseen by the very community of participants that help power the solutions. For the most part, competitions are research and development. They aim to accelerate the process of developing solutions at a fraction of the cost. Like any R&D project, the answer is sometimes that nothing else happens. Often, though, the results are carried forward beyond the competition.

This part can take time to play out. At this point we've been running competitions for over a decade, so we wanted to take this chance to pull up on some of the things we've seen.

In this post, we'll share a few examples of what happens after competitions end. Let's take a look!

Model deployment and use¶

This is the biggest category. In many cases, the winning solutions from competitions feed directly into deployed models for use.

In some of these cases, competitions partners carry forward the winning solutions themselves. For example:

In the Where's Whale-do? competition, winning solutions used deep learning approaches from facial recognition tasks (particularly ArcFace and EfficientNet) to help the Bureau of Ocean and Energy Management and NOAA Fisheries monitor endangered populations of beluga whales by matching overhead photos with known individuals.

The challenge partners at Wild Me (now part of Conservation X Labs) worked with the winning solution not only to deploy the technology through their Wildbook Flukebook platform for monitoring belugas, but also found success extending the approach to other types of wildlife like dolphins and lions.

Sample match results for a MiewId model match in Flukebook. Additional information can be found in the public implementation report.

In the MagNet competition, top solutions used Convolutional Neural Networks (CNNs) with other approaches to forecast disturbances in Earth's geomagnetic field from geomagnetic storms. These disturbances can wreak havoc on key infrastructure systems like GPS, satellite communication, and electric power transmission. An ensemble of the top solutions was able to push the state-of-the-art on unseen data, reducing error by 30% compared with the National Centers for Environmental Information (NCEI) benchmark model.

After the challenge, the research team at NOAA and NCEI worked with one of the winners to implement an ensemble of the top two models, incorporating into NOAA's High Definition Geomagnetic Model (HDGM) and making the predictions publicly available in real-time. The challenge and results were also published in the AGU Space Weather journal and was among their top 10% of viewed articles published in 2023.

Real-time predictions of geomagnetic field disturbances displayed on the Cooperative Institute for Research in Environmental Sciences (CIRES) website. Solar wind can cause disturbances in Earth's geomagnetic field, so NOAA supports models like the High Definition Geomagnetic Model - Real-Time (HDGM-RT) to provide information on what to expect.

In the Pose Bowl competition, winning solutions explored ways to implement object detection algorithms on limited hardware for use in space. As part of the setup, code submissions were run on resources that simulated the constraints of the production environment on a NASA R5 spacecraft. The competition was part of a broader effort to provide safe, low-cost methods of assessing and potentially repairing damaged ships in space using an inspector spacecraft. Following the competition, NASA took the 2nd place object-detection solution to new heights, successfully integrating it onto a CubeSat running in realtime.

Example image where host spacecraft can be detected slightly below the horizon. A bounding box has been drawn around the host spacecraft using the challenge dataset.

In other cases, our technical team works with partners to carry forward solutions so they are ready for use. For example:

In the Pri-matrix Factorization challenge, the best solution was able to identify wildlife in camera trap footage with over 95% accuracy. Following the competition, DrivenData worked with the winner and partners at the Max Planck Institute for Evolutionary Anthropology, the Wild Chimpanzee Foundation, and WILDLABS to simplify and adapt the top model in an open source Python package and no-code web application for monitoring wildlife. The application now has over 350 users and has processed more than a million images and videos.

Example output from Zamba Cloud, an application developed for conservation researchers building on data and algorithms from the Pri-matrix Factorization challenge. A sample frame is shown with the most likely species identified.

In the Tick Tick Bloom challenge, winning solutions used satellite imagery to detect dangerous concentrations of cyanobacteria, a type of harmful algal bloom, in small inland lakes and reservoirs. After the challenge, our team worked with NASA and users monitoring public health at state agencies to develop CyFi (Cyanobacteria Finder), a command line tool that streamlined winning algorithms and makes it easy to generate cyanobacteria estimates for new data. The resulting tool makes state-of-the-art machine learning available to water quality managers, achieves better results than existing tools with 10x more coverage, and was published in SciPy Proceedings in 2024.

Stylized view of severity estimates for points on a lake with a cyanobacteria bloom. Base image from NASA Landsat Image Gallery.

Proof of concept for investment¶

Sometimes, the results of a competition serve as a demonstration of what's possible with current technology, even if the solutions themselves are not the ones getting directly implemented. This demonstration is used to make the case for further investing in developing new solutions.

This is a pattern we've seen come up again and again. For example:

In the N+1 Fish, N+2 Fish challenge, winning solutions used computer vision techniques to count, measure, and identify fish caught by fishing vessels. The goal of these algorithms is to begin reducing the burden of human video review in helping sustainable fisheries monitor their catch.

The results of the competition fed into OpenEM, an open source library for electronic monitoring. Above all, the challenge successfully showed that computer vision approaches were ready to help with accurate electronic monitoring - not in 10 years, but today. This led to increased investment by companies in this space along with our partner at The Nature Conservancy, such as through work with AWS among others.

Sample screenshot from a video demonstrating OpenEM functionality for electronic monitoring systems aboard shipping vessels. The full video can be viewed on the website.

In the Open Cities AI Challenge, winning solutions worked with drone imagery to map building footprints in support of disaster risk management efforts in urban settings across Africa. Afterwards, the challenge training data was combined with other data to develop an open-source deep learning model to extract building footprints against satellite imagery. The project, called Ramp, partnered with the World Health Organization (WHO) to deploy the model in support of emergency response scenarios.

The participatory data approach from the challenge (for instance, through initiatives like OpenStreetMap and OpenAerialMap), also served as an example to build out other labelled data in a range of projects, such as those in the World Bank's digital public works initiative.

Sample of hand-labeled building footprints overlaid on drone imagery for four African urban areas included in the challenge dataset.

Benchmark data for assessing and demonstrating tools¶

For many challenges, considerable effort goes into bringing together a well-annotated dataset, clearly structured use case, and effective metrics for evaluating solutions. The resulting competition package is useful beyond the immediate solutions developed during the challenge. It often serves as an ongoing reference to benchmark and demonstrate development efforts in the broader ML/AI ecosystem.

For instance, the Differential Privacy Temporal Map Challenge featured a series of three challenge sprints, each with a set of data and metrics designed to test algorithmic approaches for maximizing the utility of a dataset while protecting individual privacy. The challenge partners at the National Institute of Standards and Technology (NIST) and Knexus Research built two of these sprints into SDNist, a set of benchmark data and evaluation metrics for deidentified data generators. The python package includes a report tool for summarizing the performance of a generated deidentified dataset across utility and privacy metrics.

Illustration of public use microdata areas (PUMAs) in Chicago, used with American Community Survey data to develop one of the benchmark SDNist datasets for testing privacy approaches. Image provided by Chicago Central Area Committee.

As another example, several different challenges have been included as benchmark datasets for TorchGeo, a PyTorch domain library for geospatial data. Wind-dependent Variables was an early demonstration example as the library was getting started, alongside the benchmarking work of the NASA IMPACT team to test and eventually host the winning CNN model on its website to estimate hurricane intensity. Even more recently, the dataset has been used to evaluate TorchGeo's pretrained models, which are available as starting points for many different geospatial tasks. The package also features datasets from The BioMassters and Cloud Cover Detection competitions.

Visualization of tropical storms from NASA IMPACT's hurricane portal, where the winning model from the Wind-dependent Variables competition has been migrated to production.

Research advancement¶

In many cases, competitions represent an opportunity to push forward state-of-the-art research and accelerate what can be achieved and learned in a short amount of time. The results are often well-suited for research publication and supporting ongoing advancement after the challenge ends.

For example, following the Hateful Memes challenge, the research team at Meta AI published the results in the Proceedings of Machine Learning Research and discussed them at NeurIPS 2020. Challenge participants also published their individual approaches. The winning approach provided a relatively early demonstration of the effectiveness of using multimodal transformers for tasks involving text and images.

Independently innocent text or image content can become harmful content when combined into a meme.

Another example is the Genetic Engineering Attribution Challenge, a first-of-its-kind competition where participants built tools to help trace genetically engineered DNA back to its lab-of-origin. The winning solutions drastically outperformed previous models at identifying true labs-of-origin. An analysis of the results was published in Nature Communications in 2022, highlighting the impressive performance and techniques used to push forward the state-of-the-art in this domain.

Sample chart from the Analysis of the first genetic engineering attribution challenge paper, showing competition solution performance relative to previous models. Bars display misclassification rate, where lower is better.

Talent discovery¶

Competitions are often used not only to advance solutions, but also as a way to engage a wide talent pool with a problem and elevate strong performers. In some cases, this helps participants demonstrate their skills at their current organizations or in job interviews. In others, it has translated directly into being discovered for a new career.

For example, in the Pover-T Tests competition, participants built machine learning models to identify the strongest predictors of poverty using detailed household survey records. As part of the contest, the World Bank offered a bonus award to the top contributor from a developing country. The list including 84 qualifying countries from the World Bank's classification of low-income and lower-middle-income economies.

The winner was Aiven Solatorio, a data scientist from the Philippines, who finished 5th on the leaderboard from among 2,300 participants and 6,000 submissions. Following the competition, Aiven was hired by the World Bank's development data group behind the challenge, where he has now been working for seven years.

Aiven Solatorio, a winner in the Pover-T Tests competition, was hired as a Data Scientist by the World Bank's Development Data Group after the challenge.

Conclusion¶

Competitions may end when the leaderboard locks, but their influence does not. Winning code may evolve into a production model, a published paper, or an open-source tool. Even when a solution isn’t deployed immediately, the knowledge gained, the data curated, and the people engaged all contribute to the greater momentum of progress.

It's worth noting that, while extension of a competition can depend on the results, there are also a few ways that partners and participants can support ongoing development. For example:

Organizing and documenting winning solutions so they are easy to understand, reproduce, and build upon
Making a plan around the competition so that if it yields strong results, there is a path to carry them forward or write them up for research
Incorporating realistic code constraints on submissions to help select for the best viable performance, if relevant
Where possible, sharing the solutions and competition data openly for ongoing development and learning

The competition examples here represent a snapshot in time. The effects from competitions often take years to play out, and it takes effort to track them down. This is worth doing every once in a while, though, because ultimately this is one of the primary reasons to run competitions in the first place.

With a well-crafted challenge, a fantastic community, and a little luck, there are many ways solutions can take on life beyond the leaderboard.

Life beyond the leaderboard

Model deployment and use¶

Proof of concept for investment¶

Benchmark data for assessing and demonstrating tools¶

Research advancement¶

Talent discovery¶

Conclusion¶

Tags

Latest posts

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

Community Spotlight: Kirill Brodt

Jump-starting data infrastructure and in-house data expertise

A production application to support survivors of human trafficking