case studies

Linking nonprofit grants to organizations with machine learning

DrivenData built Orgmatch, a scalable and explainable entity resolution system to add value to information processed by a leading nonprofit data hub.

The organization

Candid provides data and insights about the nonprofit sector. Candid tracks the flow of funding through the network of nonprofit grantmakers and recipients, making the dynamics of the social sector more visible and transparent.

The challenge

Candid maintains a database of millions of nonprofit organizations and ingests approximately six million public records annually that document the grants between organizations. Accurately matching entities to funding flows supports multiple Candid tools to search, visualize, and research the nonprofit space.

These tools help nonprofits identify foundations who could support their work and enables researchers to track where philanthropic dollars go. However, public records inevitably contain variations and missing data in names, addresses, and other details, creating a high risk for duplicates and missed connections.

Candid had built a legacy system which used dozens of heuristic rules to automatically determine matches, but this system was straining under the scale of data as well as the variety of data quality issues observed in public records. Candid asked us to improve the accuracy of the matching process and provide more observability into model decisions.

The approach

In collaboration with Candid’s Technology and Data Science teams we designed and built Orgmatch, an intelligent software tool which rapidly determines whether a given organization should be matched to an already existing organization in Candid’s databases. We analyzed Candid’s dozens of legacy heuristics and synthesized the most effective down to a small subset. We then supplemented these rules with a machine learning model trained on human-labeled matches to enable Orgmatch to learn matching patterns that extended beyond simple matching rules and which could improve as more data became available.

Infographic explaining how the Orgmatch model works. It takes in new documents, asks if it belongs to a known organization. Adds to the organization if it matches confidently. Has a human review the match if it is uncertain. And creates a new organization if no matching ones are found.
How Orgmatch processes documents to make matches.

One important design consideration for Orgmatch is the cost of false positive matches. Incorrectly merging two organizations is a far worse outcome than creating a duplicate. The design of Orgmatch accounts for this error sensitivity by automating high-confidence matches and marking less certain but still confident matches for further human review. In collaboration with subject matter experts in the organization, we carefully tweaked decision policies to optimize the tradeoff between automation and human-in-the-loop checks, maximizing high quality matches while minimizing false positives.

In order to observe, explain, and continuously improve on this process, we built a database to store and search all Orgmatch decisions, including the documents compared and the heuristics/model used in decision making. We built a dashboard and visualization tooling to evaluate model performance and the volume of decisions. Finally, we trained Candid’s data science team on how to maintain Orgmatch, investigate matches, and continuously improve on the process.

The results

DrivenData helped Candid significantly improve our entity matching algorithm, which has had a direct impact on the quality of data we provide to the social sector. Beyond the technical deliverables, they worked alongside our team to build lasting capacity in data science and AI, not just deliver a solution.

Shane Ward, VP of Data, Candid

Orgmatch processes over six million documents per year and automatically processes 98% of these documents. By implementing and improving Orgmatch, we were able to remove hundreds of thousands of matches from human review, improving data quality and saving weeks of manual review time.

Orgmatch has improved the accuracy of Candid’s database of organizations and grants. The improved database helps funders and organizations spend and raise money in more efficient, accountable ways.

Our real-world impact

All projects
Partners: Max Planck Institute for Evolutionary Anthropology, Arcus Foundation, WILDLABS

Automating wildlife identification for research and conservation

Detected wildlife in images and videos—automatically and at scale—by building the winning algorithm from a DrivenData competition into an open source python package and a web application running models in the cloud.

Partners: CodePath

Data engineering from the ground up

Built data infrastructure to ingest, clean, integrate, and organize data across CodePath, created interactive dashboards for accurate monitoring of program trends, and provided trusted data expertise to identify and hire talent to carry the work forward.

Partners: The National Center for State Courts

Building a private LLM sandbox for NCSC

We worked with the National Center for State Courts to build an LLM chat sandbox for private usage. This sandbox allows users to experiment with LLM tools in a way that is safe, secure, and cost-effective, with specific use cases and prompts relevant to their work.

Partners: The World Bank, The Conflict and Environment Observatory

Identifying crop types using satellite imagery in Yemen

Used satellite imagery to identify crop extent, crop types and climate risks to agriculture in Yemen, informing World Bank development programs in the country after years of civil war.

Partners: Private sector, social sector

Building applied solutions with LLMs

Built solutions using LLMs for multiple real-world applications, across tasks including semantic search, summarization, named entity recognition, and multimodal analysis. Work has spanned research on state-of-the-art models tuned for specific use cases to production ready retrieval-augmented AI applications.

Partners: Bureau of Ocean Energy Management, NOAA Fisheries, Wild Me

Protecting endangered beluga whales with computer vision

Designed and administered a computer vision challenge that produced state-of-the-art machine learning models to identify and match individual endangered beluga whales from photo surveys.

Partners: EverFree

A production application to support survivors of human trafficking

Built the Freedom Lifemap platform, a digital tool designed to support survivors of human trafficking on their journey toward reintegration and independence

Partners: ReadNet

Crowdsourcing solutions for AI assisted early literacy screening

Ran a machine learning challenge to develop automatic scoring methods for audio clips from literacy screener exercises. Automated scoring can help teachers quickly and reliably identify children in need of early literacy intervention.

Partners: Science for America

Making higher education data more accessible

Created an open source Python library and interactive data visualization platform for analyzing U.S. higher education data and illuminating trends and disparities in STEM education.

Partners: Candid

Linking nonprofit grants to organizations with machine learning

Built Orgmatch, a scalable and explainable entity resolution system to add value to information processed by a leading nonprofit data hub.

Partners: IDEO.org

Illuminating mobile money experiences in Tanzania

Analyzed millions of mobile money records to uncover patterns in behavior, and then combined these insights with human-centered design to shape new approaches to delivering mobile money to low-income populations in Tanzania.

Partners: Insecurity Insight, Physicians for Human Rights

Tracking attacks on health care in Ukraine

Built a real-time, interactive map to visualize attacks on the Ukrainian health care system since the Russian invasion began in February of 2022. The map will support partner efforts to provide aid, hold aggressors accountable in court, and increase public awareness.

Partners: Wellcome

Addressing algorithmic bias in medical research

Conducted a literature review to understand the current state of bias identification & mitigation in mental health research, and synthesized recommended best practices from the field of machine learning.

Partners: CABI Plantwise

Mining chat messages with plant doctors using language models

Automated recognition of agricultural entities (such as crops, pests, diseases, and chemicals) in WhatsApp and Telegram messages among plant doctors, enabling new ways to surface emerging trends and improve science-based guidance for smallholder farmers.

Partners: NASA

Monitoring water quality from satellite imagery

Created an open-source package to detect harmful algal blooms using machine learning and satellite imagery. Included running a machine-learning competition, conducting end user interviews, and engineering a robust, deployable pipeline.

Partners: Data science company foundation

Matching students with schools where they are likely to succeed

Used machine learning to match students with higher education programs where they are more likely to get in and graduate based on their unique profile, with a focus on backgrounds traditionally less likely to attend college or apply to more competitive programs.

Partners: University of Maryland

Processing multimodal tutoring data

Built well-engineered data pipelines to extract machine learning features from audio, video and transcript data collected from online tutoring sessions, enabling a team at the University of Maryland to study how relationship-building affects student outcomes.

Partners: Fair Trade USA

Mapping fair trade products from source to shelf

Visualized the flow of fair trade coffee products from the farms where they are grown to the stores where they are sold, connecting the nodes in supply chain transactions and increasing transparency for customers and auditors.

Partners: The World Bank, Angaza, GOGLA, Lighting Global

Developing performance indicators and repayment models in off-grid solar

Analyzed repayment behaviors across dozens of pay-as-you-go (PAYG) solar energy companies serving off-grid populations throughout Africa, and developed KPIs to facilitate standardized reporting for PAYG portfolios.

Partners: Haystack Informatics

Modeling patient pathways through hospitals

Mapped out the probabilistic patient journeys through hospitals based on tens of thousands of patient experiences, giving hospitals a better view into the timing of the activities in their departments and how they relate to operational efficiency.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.