Launching the K-12 AI Infrastructure Platform

Open Datasets, Benchmarks, and Challenges for Education AI

K-12 AI Infrastructure Platform

Today, we're launching a new platform in collaboration with the K-12 AI Infrastructure Program. The mission is simple: build AI that actually serves students and educators.

The platform is designed to support that mission in three key ways:

  • Gather and distribute core AI infrastructure comprising datasets and models.
  • Ground AI advances in learning science and student outcomes through curated benchmarks.
  • Build a collaborative community through challenges and community discussion forums.

To understand the platform's purpose, it's helpful to define what we mean by AI Infrastructure. The term may bring to mind datacenters, server racks, GPUs, or the massive underwater cables that form the backbone of the internet. It may bring to mind human capital, technical capacity, or governance, regulation, and privacy. Those are all important infrastructure components for AI, but they are not the layers we are focusing on to build AI that works for education.

This platform focuses on the necessary ingredients for AI model development: open datasets, models, and benchmarks. These are the digital public goods that facilitate the development and deployment of AI models. The current generation of LLMs is undeniably shaped by the data, tasks, and open-source releases that have gained the attention of AI researchers over the last 25 years. We want our AI models to be shaped by learning science, student outcomes, and the needs of classroom teachers and administrators. To make this happen, we need to build an AI development community around the datasets, models, and benchmarks that reflect these principles.

Come join us on the platform or read on to learn more about what we have built and where we are going.

Open Education Datasets for AI Development

AI models are only as good as the data they are trained on. To that end, we are gathering and distributing data that is AI-ready, openly available, and relevant to K-12 education. We're launching with nine open datasets you can explore on the platform, but that is just the beginning.

  • Bridge: Online math tutoring conversations annotated for the type of mistake and the strategy for correction.
  • DrawEduMath: Hand-drawn responses to math problems annotated with QA pairs about the students' work.
  • EssayJudge: Short argumentative essays describing math or science charts annotated for lexical quality.
  • FairytaleQA: Fairytale excerpts with QA pairs categorized into one of seven narrative elements.
  • Grade School Math (GSM8k): 8.5k multi-step, grade-school math word problems with solutions.
  • MRBench: Math tutoring dialogues with LLM responses annotated across eight pedagogical dimensions.
  • SciQ: 13.7k multiple-choice science questions with supporting text, response choices, and correct answer.
  • SemEval: Short-answer science questions with graded responses.
  • TalkMoves: Classroom transcripts labeled for discursive teacher moves.

Beyond these, we expect to add many more datasets, both funded by our program and developed by other institutions. We are prioritizing open data, but will also have options for datasets that require gating access at different levels: datasets that require an account to access, datasets that require agreeing to specific terms, and datasets that require requesting access and explicit approval.

If you've got a dataset you think belongs on the platform, don't hesitate to reach out at k12-ai-infra@drivendata.org.

Education AI Benchmarks That Measure What Matters

When new models are released, frontier labs often tout progress on specific benchmarks, predefined tasks that measure model capabilities. Labs report benchmarks against measures of economically valuable work or software engineering. Despite the attention paid to how AI will change education, model performance on education-specific tasks is rarely discussed, beyond how models perform on standardized tests, which is not a measure of how effective they are as tools for students or teachers.

Today, we're launching with a benchmark called "SAGE: Science Answer Grading & Evaluation". The task is to compare student answers against a reference answer and decide if the student answer is correct, partially correct, or contradictory. The SAGE benchmark is built on the SemEval - Joint Student Response Analysis dataset.

Diagram evaluating three student answers against a reference text about photosynthesis. The first detailed answer is graded "Correct", the second brief answer is "Partially Correct", and the third incorrect answer is "Contradictory".

You may think that current state-of-the-art models could do this task with ease, but this is another demonstration of the jagged frontier. The latest models get just over half of the classifications correct (comparable to how custom-trained models performed on this task in 2013).

Benchmarks are the metrics that help us define the tasks we want models to be good at, and they shape the direction of research and development efforts. We've got lots of plans for additional benchmarks, different evaluation structures (llm-as-judge, verified evaluators, and ELO-style human judgment), and new ways of interacting with benchmarks. For example, on this benchmark, users will soon be able to submit their own approach, whether that is a fine-tuned model, a new prompt, or an agent harness, and see how it stacks up against the frontier labs.

Community for Education Experts and AI Developers

Most importantly, we can't make progress without a community that cares. We want to create a convergence point for two communities: on the one hand, engage the K-12 education community in AI development, and on the other, interest the AI development community in K-12 education. One way we will do that is through challenges incentivized with monetary prize pools. In addition to the platform itself, we're launching our first prize challenge today as well.

Trace the Ace challenge

In the challenge, called Trace the Ace, participants are asked to build AI models that parse and understand tutoring transcripts, then use that knowledge to predict whether a student will get the next quiz question correct. The transcripts come from our challenge partner, the National Tutoring Observatory, and are real human tutor transcripts. A successful model will be able to trace student knowledge and, in the end, help us learn more about what works in tutoring and why.

There are $50,000 in prizes available, and winners will be evaluated not only quantitatively, who makes the most accurate predictions, but also on the quality of their insights into the task. Solution write-ups are required to win a prize, and additional prizes are awarded to the teams that bring their write-ups through to publication.

If you are looking for ways to engage beyond the challenge, all public goods have their own category in the platform discussion forum. You can come there to talk about what you find in a dataset, ask questions about methods for a benchmark, or share tips on how to move up the leaderboard in a competition. Or, if you are moved to, start a discussion on another K-12 AI Infrastructure topic!

Join us

Come and explore the resources. Use the datasets and share your results. Participate in the benchmarks and challenges. Ask the hard questions in the discussion forum. We can't wait to see what you build. We're just getting started and are looking forward to shaping this platform together.

—---

Thank you to Digital Promise, alongside core partners in the K-12 AI Infrastructure Program, Learning Data Insights, DrivenData, Massive Data Institute at Georgetown University, and Catalyst @ Penn GSE.

Thank you to the National Tutoring Observatory for their partnership in launching our first challenge on this new platform. And finally, thank you to all the researchers and organizations that have published open datasets that serve as the foundation of this work.

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

resources

Launching the K-12 AI Infrastructure Platform

Open Datasets, Benchmarks, and Challenges for Education AI

winners

Meet the winners of the On Top of Pasketti: Children's Speech Recognition Challenge

Learn how competition winners, working with one of the largest labeled children's speech datasets assembled, cut transcription error rates in half.

insights

DrivenData 10-Year Impact Report: Three pathways to creating social impact with data science and AI

An overview of how DrivenData’s impact is built through projects, portfolios, and people working together.

tutorial

Improving Automatic Speech Recognition for Kids - A Reference Implementation for Phonetic-level Transcription

A step-by-step guide to training a model to predict phonetic symbols for the On Top of Pasketti Challenge (Phonetic Track)

tutorial

Improving Automatic Speech Recognition for Kids - A Reference Implementation for Word-level Transcription

Learn how to train a model to transcribe child speech for the On Top of Pasketti Challenge (Word Track)

insights

5 Challenges of Creating Beautiful Data Pipelines

A look into the hidden complexity of data pipelines, and some suggestions to improve the process.

insights

AI Agents in Data Science Competitions: Lessons from the Leaderboard

How good are AI agents at data science? Here's what we've learned from initial experiments about what works, what doesn't, and what the future might hold.

case studies

Linking nonprofit grants to organizations with machine learning

DrivenData built Orgmatch, a scalable and explainable entity resolution system to add value to information processed by a leading nonprofit data hub.

insights

Bringing small water bodies into view: Sentinel-2 satellite monitoring of harmful algal blooms (HABs)

CyFi enhances modern HAB monitoring programs by extending their reach and informing field-based components.

insights

Solving the last-mile public data problem

Using "baked" data to transform public data repositories into analysis-ready resources

media

DrivenData Joins U.S. Department of Energy's Genesis Mission to Advance AI for Science and the Public Good

Social impact data science organization brings decade of federal open innovation experience to historic national initiative

winners

Meet the winners of Phase 3 of the PREPARE Challenge

Learn how teams developed proof-of-concept approaches for real-world early Alzheimer's prediction

winners

Meet the winners of the AI for Advancing Instruction Challenge

Learn how the winners of the AIAI challenge leveraged multimodal classroom data to identify instructional activities and classroom discourse content.

case studies

Automating wildlife monitoring with Zamba & Zamba Cloud

DrivenData partnered with conservation researchers to create Zamba, an open-source machine learning solution that helps wildlife researchers process camera trap footage, reducing months of manual review to hours of automated analysis.

community

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

The Community Spotlight features fantastic members from our DrivenData community. Three members of the IGCPHARMA team, Paola Ruiz, Néstor González, and Daniel Crovo talk to us about data science, drug discovery, diverse databases and more!

community

Community Spotlight: Kirill Brodt

The Community Spotlight features fantastic members from our DrivenData community. Kirill Brodt, a researcher in computer graphics at the University of Montreal, talks animation, pose estimation, and data science challenges.

case studies

Jump-starting data infrastructure and in-house data expertise

DrivenData designed and built a data warehouse to centralize, organize, and visualize data across CodePath's operations. Our team also provided technical hiring assistance to find the right talent to carry the work forward.

case studies

A production application to support survivors of human trafficking

DrivenData developed Freedom Lifemap, a digital tool designed to support survivors of human trafficking on their journey toward reintegration and independence.

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

insights

(Tech) Infrastructure Week for the Nonprofit Sector

Reflections on how to build data and AI infrastructure in the social sector that serves the needs of nonprofits and their beneficiaries.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.