Open Source Projects

DrivenData maintains a number of popular open source projects for the data science, machine learning, and software engineering communities. Check them out here!

Cookiecutter Data Science

A logical, reasonably standardized, and flexible project structure for doing and sharing data science work

Since starting DrivenData, we’ve seen a lot of data science in the wild. As the field develops, it’s becoming increasingly important to organize data science work so that it’s easy to reproduce and build upon.

Cookiecutter Data Science is a widely used project template that keeps data scientists organized and on track.

Deon: An Ethics Checklist for Data Scientists

A command line tool that allows you to easily add an ethics checklist to your data science projects

When there's a lot at stake, checklists make sure big questions don't slip through the cracks and tough conversations happen even (especially) in fast-moving environments. The goal of deon is to push that conversation forward and provide concrete, actionable reminders to the developers that have influence over how data science gets done.

One command jumpstarts the conversation all data teams should be having. Explore the checklist here!


pathlib-style classes for cloud storage services

Have you wished for a consistent and easy interface in Python to access files in cloud storage like S3 and Azure? cloudpathlib is an extensible Python library that provides pathlib.Path-style classes for dealing with files in various cloud storage services, with seamless local caching.

Our goal is to be the meringue of file management libraries: the subtle sweetness of pathlib working in harmony with the ethereal lightness of the cloud.


Making it easier to code review Jupyter notebooks, one script at a time

nbautoexport automatically exports Jupyter notebooks to various file formats (.py, .html, and more) upon save while using Jupyter. One great use case is to automatically have script versions of your notebooks to facilitate code review commenting.


entity relationship diagrams for Python data model classes like Pydantic

Looking for an easy, clean way to visualize your data model? erdantic is a simple tool for drawing entity relationship diagrams (ERDs) that show how data model classes are connected. Generate ERDs from models defined with multiple supported frameworks, such as Pydantic and dataclasses.

If you have data models in Python, this is a great way to illustrate your schema and add a visual reference to your documentation.


Path style access for pandas

Love pathlib.Path? Love pandas? Wish it were easy to use pathlib methods on pandas Series? This package is for you.

Just one import adds a .path accessor to any pandas Series or Index so that you can use all of the methods on a Path object.

Winning Models from DrivenData Competitions

Prize-winning algorithms from DrivenData’s competitions

DrivenData runs machine learning competitions to help non-profits, NGOs, governments, and other social impact organizations use data science in service of humanity. Part of our mission is to enable data scientists and mission-driven organizations to learn from the work done in these competitions. To this end, the code submitted by winners is released under an open source license for others to learn from, use, and adapt.

Check out how ML experts built their winning algorithms!

Project Zamba

Computer vision for wildlife research and conservation

At the end of 2017, data scientists from more than 90 countries around the world drew on more than 300,000 video clips in a competition to build the best machine learning models for identifying wildlife from camera trap footage. Following the competition, the top-performing submission was packaged into an open source software tool and made available for general use by researchers and conservationists.

Zamba is an open-source Python package that identifies 23 animals in video data.

Concept to Clinic

An AI-powered application for early lung cancer detection built for radiologists

In the Concept to Clinic challenge, hundreds of data scientists and engineers from around the world came together to build open source tools to fight the world’s deadliest cancer. The prototype developed during the live challenge period between August 2017 and January 2018 focused on helping clinicians flag, assess, and report concerning nodules from CT scans.

This open-source project is an end-to-end application that allows radiologists to better interact with state-of-the-art AI as part of their diagnostic process.