blog insightstools

Solving the last-mile public data problem

Using "baked" data to transform public data repositories into analysis-ready resources

Chris Kucharczyk
Senior Data Scientist

What is the "last-mile" problem in public data?

Public data is intended to empower people and deliver social impact. Governments, agencies, and institutions publish datasets to increase transparency, inform the public, and enable research. But anyone who has ever tried to use these datasets knows that just because data is public doesn’t mean it’s accessible, usable, or useful.

Most public datasets come as sprawling collections of CSVs or spreadsheets with schemas that change over time, cryptic column names, and complex data dictionaries. Even if you’re reasonably technical, you can lose days just figuring out what’s what. If you’re not technical, you’re basically locked out from using the data at all.

That gap between data that is "published and available" and data that is "directly usable" without a significant amount of preprocessing is what we think of as the last-mile public data problem. It stops countless people, even technical ones, from turning data into useful information.

A schematic of the last mile data problem showing data moving from generation to organization to publication, and a gap to a final step of packaging and pre-processing.

How a "baked data" approach improves public data usability

In order to solve this last-mile public data problem, we were inspired by the concept of "baked data," originally defined by Simon Willison as "bundling a read-only copy of your data alongside the code for your application, as part of the same deployment." He coined the phrase after a pattern he used for deploying web applications: instead of pointing your app at some external database over the network, you bundle a read-only copy of the data right alongside your code. The app ships with the data it needs, tightly coupled and easy to deploy.

In our work, we’ve adapted this idea when working with public datasets. Instead of telling people:

"Here are 50 CSV files that you have to munge and preprocess yourself,"

we asked:

"What if we preprocess all of this data once, turn it into a clean, queryable database, and ship that as a single portable file?"

In order to do this, we chose DuckDB as our database. DuckDB is an OLAP (online analytical processing) database that runs in-process, making it very fast for aggregation and portable across machines and environments. You can query tables directly with SQL or use bindings in Python (or many other languages). And crucially, it can package an entire analytics-ready dataset into a single file you can drop into a project, a notebook environment, or a small web app.

A schematic of traditional versus baked data deployments showing how traditional deployments treat data and application as separate, while a baked data approach bundles data and application together.
An illustration of the difference between "traditional" and "baked" data deployments.

The result is a recipe for "baked public data": you start with the messy public-source data ingredients, you process, combine, and bake, and what you give users is a stable, ready-to-query database.

A schematic showing the pre-processing of structured but unorganized files into a DuckDB database.
A schematic of the baking process for scipeds, starting with raw data and ending with a single DuckDB database file.

Case study: Turning IPEDS into a "baked" education dataset

We put this approach into practice in collaboration with Science for America, a solutions incubator that brings together scientists, technologists, and cross-sector partners to develop game-changing initiatives addressing critical societal challenges. Science for America wanted to answer questions about U.S. higher education STEM programs, e.g., how the share of women in STEM has changed over time or how different institutions have evolved.

The public data source we started with is IPEDS (the Integrated Post-secondary Education Data System operated by the National Center for Education Statistics). IPEDS is a treasure trove of data that is collected annually from all higher education institutions that participate in federal student aid programs. But IPEDS data suffers from the same challenges discussed earlier. It contains a collection of multiple files per year accumulated over 30 years. Field codes and categorizations have changed over time. Older data dictionaries are only available as HTML files. On top of that, different reporting institutions classify STEM fields differently, creating overlapping taxonomies to reconcile.

To get started, we built a pipeline that pulls all of this data–CSVs, Excel spreadsheets, HTML data dictionaries, crosswalks for code changes–into a normalized relational database. We mapped evolving codes into consistent categories, built standard enumerations for degrees and fields, and stitched together institutional metadata to join data to university characteristics.

At the end of that pipeline, we wrote everything into a .duckdb file. In total, there are about 22 million rows in a ~320 MB file, relational and ready for analytical queries. We version this file and release it alongside the code that generates it, so users can always access a specific snapshot of the data.

On top of the database, we then built the scipeds Python package. scipeds provides a higher-level query interface so you don’t have to write SQL directly. You define what you’re interested in—say, bachelor’s degrees in STEM, broken out by gender, using a specific STEM taxonomy—and the library generates the appropriate SQL behind the scenes. It knows about built-in taxonomies (such as NCSES or DHS definitions of STEM) and validates query parameters against enumerations while handling low-level schema details internally.

A screenshot of the scipeds dashboard interface showing a chart of the relative proportion of men and women in STEM at different universities
A screenshot of the scipeds homepage. scipeds is a reproducible, DuckDB-backed dataset and Python library developed by DrivenData and Science for America that standardizes and serves U.S. STEM education data derived from IPEDS. The scipeds Python package is [here](https://docs.scipeds.org/).

For someone working in Python, that means you can go from "I wonder how women in STEM have changed over time" to a usable time series and a plot in a few lines of code, without manually touching raw IPEDS files.

Making public data useful for the non-technical user

To fully solve the last-mile challenge we set for ourselves, we also had to address the reality that not everyone is going to write Python. So we took the same DuckDB file and dropped it behind a Plotly Dash app, deployed with a modest server: a couple of CPUs and a few gigs of RAM. The web server downloads the database at deploy time, and queries are processed server-side. Despite these modest technical specs, DuckDB’s speed means that filtering and aggregation are still responsive in real-time.

As a result, someone with no technical background can go to the scipeds site (https://www.scipeds.org/), pick an institution, pick a field like Engineering or Computer Science, choose a time range, and explore representation trends interactively. They don’t need to know SQL, Python, or anything about IPEDS data structures. They just see charts that answer the questions they care about.

scipeds exemplifies the advantages of a baked data architecture: one artifact can serve multiple access modes across multiple platforms seamlessly.

A screenshot of the scipeds dashboard interface showing a chart of the relative proportion of men and women in STEM at different universities
A screenshot of a Plotly Dash app visualization from [scipeds.org](https://www.scipeds.org) showing the relative proportion of men and women in STEM at different universities.

The Plotly dash app on the scipeds site (https://www.scipeds.org/) allows immediate data exploration for non-technical users. We invite you to try it out to ask and answer questions about higher education for yourself!

Baked data is a reproducible approach that improves public data access and usability

Before this work, answering a seemingly simple question from IPEDS like "How has the number of women studying STEM changed over time in the U.S.?" meant digging through a pile of heterogeneous files and patching together lots of one-off logic. Now it’s a small, readable query on a single database or a visit to a webpage.

From a deployment perspective, things are simpler too. There’s no separate hosted database cluster to maintain. For most use cases, you just need to download the DuckDB file where your code or web app can reach it.

Because of the bundling of data and application, our baked data approach delivers an additional reproducibility benefit. If someone uses scipeds for an analysis and wants to recall or reproduce it later, they can pull the same version of the package and the same underlying database. This works because the DuckDB file is generated during the release process and versioned alongside the code. In essence, every library release corresponds to a particular state of the data.

When the baked data pattern works, and when it doesn’t

Three things became obvious while building scipeds. We present these as lessons learned for future developers of baked data systems.

  1. Everything depends on the data pipeline.
    If you don’t invest fully in turning messy public data into a consistent, analysis-ready dataset, the recipe doesn’t work. Taking care of the enumerations, taxonomies, standardized concepts, and so forth—the work of data analysts, data engineers, or data scientists—is what bakes the data so users can answer their questions without constantly tripping over schema quirks.

  2. DuckDB is a powerful and portable tool.
    It gives data scientists a familiar SQL engine, Python users a clean API, and application developers a simple, portable artifact that is much faster for analytical queries than an OLTP SQLite database. With DuckDB you don’t need heavyweight infrastructure to get serious analytical performance from structured datasets.

  3. Accessibility must be layered to make the data truly accessible and useful.
    Some people are happy writing SQL. Some are comfortable in Python but not SQL. Some aren’t going to open a notebook ever. If you want a public dataset actually to be used by your target audience, you need to meet them at their level, whether that's a database, a language wrapper, or a UI.

Finally, we want to acknowledge that this approach isn’t a silver bullet for all types of public data. It works best for structured data that’s small to mid-sized and relatively static. If you’re dealing with streaming data, petabyte-scale logs, or purely unstructured text, you’re in a different territory. And you still have to solve the unglamorous problems of hosting and paying for whatever you put on the web.

If any of the above more complex scenarios apply to your situation, reach out. We love a challenge!

How to achieve the gold standard for published public data: useful, usable, and used

Solving the last mile problem for public data isn’t about publishing more instructions and data dictionaries; it’s about whether someone with a real question can get a real answer without drowning in data munging. The baked data approach that underlies scipeds is a simple, reproducible recipe for closing the last-mile gap in three steps:

  1. Do the heavy integration work up front,
  2. package the result as a portable analytical database, and
  3. give people multiple paths into it.

Public data has huge value, and the investment that organizations make into its creation is significant. When it is easily and directly usable, it achieves its full potential for social impact.

Dive deeper: Watch a detailed slide presentation about solving the last-mile public data problem using baked data by Senior Data Scientist Chris Kucharczyk here. For more information about the scipeds application, see this case study here.

How well does your published data address the last-mile problem? DrivenData offers free last-mile data problem evaluations. Get in touch here.

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

insights

Solving the last-mile public data problem

Using "baked" data to transform public data repositories into analysis-ready resources

media

DrivenData Joins U.S. Department of Energy's Genesis Mission to Advance AI for Science and the Public Good

Social impact data science organization brings decade of federal open innovation experience to historic national initiative

winners

Meet the winners of Phase 3 of the PREPARE Challenge

Learn how teams developed proof-of-concept approaches for real-world early Alzheimer's prediction

winners

Meet the winners of the AI for Advancing Instruction Challenge

Learn how the winners of the AIAI challenge leveraged multimodal classroom data to identify instructional activities and classroom discourse content.

case studies

Automating wildlife monitoring with Zamba & Zamba Cloud

DrivenData partnered with conservation researchers to create Zamba, an open-source machine learning solution that helps wildlife researchers process camera trap footage, reducing months of manual review to hours of automated analysis.

community

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

The Community Spotlight features fantastic members from our DrivenData community. Three members of the IGCPHARMA team, Paola Ruiz, Néstor González, and Daniel Crovo talk to us about data science, drug discovery, diverse databases and more!

community

Community Spotlight: Kirill Brodt

The Community Spotlight features fantastic members from our DrivenData community. Kirill Brodt, a researcher in computer graphics at the University of Montreal, talks animation, pose estimation, and data science challenges.

case studies

Jump-starting data infrastructure and in-house data expertise

DrivenData designed and built a data warehouse to centralize, organize, and visualize data across CodePath's operations. Our team also provided technical hiring assistance to find the right talent to carry the work forward.

case studies

A production application to support survivors of human trafficking

DrivenData developed Freedom Lifemap, a digital tool designed to support survivors of human trafficking on their journey toward reintegration and independence.

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

insights

(Tech) Infrastructure Week for the Nonprofit Sector

Reflections on how to build data and AI infrastructure in the social sector that serves the needs of nonprofits and their beneficiaries.

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

insights

AI sauce on everything: Reflections on ASU+GSV 2025

Data, evaltuation, product iteration, and public goods: reflections on the ASU+GSV Summit 2025.

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

case studies

Crowdsourcing solutions for AI-assisted early literacy screening

DrivenData ran a machine learning competition to develop models for scoring audio recordings from literacy screener exercises completed by students in kindergarten through 3rd grade.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.