blog

Cookiecutter Data Science V2


by Peter Bull, Jay Qi, Chris Kucharczyk

The original Cookiecutter Data Science (CCDS) was published over 8 years ago. The goal was, as the tagline states “a logical, reasonably standardized but flexible project structure for data science.” That version, now affectionately called V1, has been a workhorse for a long time, and got the job done for many projects while being mostly unchanged.

We’ve had over 7.6k stars, and—even more encouraging—2.4k forks. The conventions we advocated have been adopted in products, industry teams, academic research, data science portfolios, and domain-specific forks. The demand for consistent project structure has come up on Stack Overflow, Reddit, HackerNews, and more (and there's a reason why).

That said, in the past 5 years, a lot has changed in data science tooling and MLOps. Cookiecutter V2 is designed to embrace these changes and look to the future. We’ll keep our Unix-like philosophy: pick a tool that does each single job well and then chain those together into a workflow. We want to be able to swap different tools in and out as they develop and mature.

Side-note: another change from the past five years is an increase in data science leaders and managers across engineering orgs. If you're more interested in the principles that drive the design of CCDS than the details, check out our 10 Rules of Reliable Data Science.

The vision for V2 is to give CCDS users more flexibility while still making a great template if you just put your hand over your eyes and hit enter. We’ll keep V1 around, and you’ll always be able to keep using it, but expect changes in V2.

CCDS logo, chart with a cookie on it Hello from our new, friendly, welcoming, definitely not an AI overlord cookie logo!

We've got a lot to cover in this release announcement:

What’s new

There are a number of new things that we are excited about in CCDS V2. These include:

  • A new CLI entrypoint, ccds. Controlling the command line entrypoint gives us greater control over the cookiecutter process. This enables a lot of the features below.

  • Greater optionality: a large number of feature requests have been people wanting to choose other tools than the V1 defaults to solve a task. As tools mature and grow in adoption, we want it to be easy for people to use the best tools in their project template.

  • Better documentation with more examples, clearer explanations of the choices and tools, and a more modern look and feel. Find the latest at https://cookiecutter-data-science.drivendata.org/ (the old documentation will redirect here shortly).

  • CCDS tests: V1 has no tests. We now run tests across OSes that actually go through the cookiecutter steps. This was a real journey in obscure OS errors, but it means that we can make changes in the future with confidence.

  • A vision for extensibility: enabling better automation, more community contributions, more customization, and more collaboration.

We mentioned that the template has a Unix-like philosophy: chain together the best tools for a task rather than trying to be an all-in-one solution. We’ll go through each task CCDS V2 provides tools for and talk about how they work.

Updates to core data tasks

Folder structure and organization

The core folder structure of cookiecutter has worked solidly for many different kinds of projects. In V2, there is one major change: src is renamed to {{ cookicutter.module_name }}, which by default is your project name (but can be further customized). This better reflects the common Python practice of having your top level module be the project name.

Other than those changes, here are a few reflections on how the default folder structure gets used in the wild. 

  • For small datasets, people add data to source control. This works as long as you have small enough data you don’t need git LFS and you treat raw data as immutable and analysis as a DAG so you don’t have to worry about merge conflicts in your data.

  • For most projects docs and references are left empty or deleted. However, we’ve seen docs be very useful as analysis project mature into packages (as an example, see our CyFi project) and references be useful when code and data is handed off. Frankly, we’d encourage more teams to be proactive in adding documentation of the project code in docs and documentation on the data in references.

  • The models folder has been used to keep track of experiments, hyperparameter configurations, results, and model artifacts. Some projects manage this folder like the data folder and sync it to a canonical store (e.g., AWS S3) separately from source code. Some projects opt to remove it and use a separate experiment tracking tool.

Dependency management

V1 only supported pip out of the box. Now you can choose from pip, conda, or Pipenv as your dependency management system for your project. We’ll set up a make create_environment command you can use to instantiate a new environment. Then you can run make requirements to install all of the packages after activating the environment. Other tools for dependency management tools such as Poetry, PDM, and pip-tools may be supported in future versions depending on community interest.

Data storage

V1 was designed to encourage data scientists to (1) separate their data from their codebase and (2) store their data on the cloud. We picked S3 since it worked for our workflows and we knew how to configure all of the pieces. We have now added support for Azure and GCS as well. For any of these tools, you will need to have the cloud provider’s CLI installed already.

That said, as a team we often use cloudpathlib (another DrivenData project) to interact with files in cloud storage directly with a persistent local cache. This way syncing happens behind the scenes and individual files are only cached locally as needed. This also obviates the need for an extra command line tool to manage syncing to the cloud.

Project documentation

As data science codebases live longer, code is often refactored into a package. CCDS V2 makes this workflow seamless. One of the things that packages need is good documentation. We chose MkDocs (instead of Sphinx) as the new default, which is what we use happily for most of our projects. We want people to write docs. We want that to be as low-friction as possible for projects so that it actually happens. MkDocs hits the sweet spot for our team, and we recommend it for others.

Additional functionality and CCDS improvements

CCDS documentation

We’ve given our own CCDS docs a refresh as well. Enjoy! The goal is to have more comprehensive documentation:

Linting and formatting

The Black code formatter entered the Python developer ecosystem since creation of CCDS V1, and it's one of the tools that our team most appreciates. Instead of manually fixing flake8 issues or arguing over stylistic considerations, projects can just run black and have their code formatted for them.  Now, out of the box projects have a make format command that runs black to format the code base, and a make lint command that runs flake8 and black –check to make sure the code conforms with standards. Ruff is also emerging as a great all-purpose formatter and linter for Python codebases and may be an option in later CCDS versions.

CCDS badge

Made by the Cookiecutter Data Science project template

Let people know what to expect with the CCDS badge instead of the footer text. Badges are delightful. Join the fun!

Markdown:

[![Uses the Cookiecutter Data Science project template](https://img.shields.io/badge/CCDS-Project%20template-328F97?logo=cookiecutter)](https://cookiecutter-data-science.drivendata.org/)

HTML:

<a target="_blank" href="https://cookiecutter-data-science.drivendata.org/">
    <img src="https://img.shields.io/badge/CCDS-Project%20template-328F97?logo=cookiecutter" alt="Uses the Cookiecutter Data Science project template">
</a>

Basic dependencies out of the box

CCDS V1 came with just a few minimal dependencies specified ahead of time. You can now choose a "basic" set of PyData packages (Pandas, NumPy, Jupyter, scikit-learn and more) out of the box and the dependencies will be added to the dependency management file of your choice. This will also enable easily adding core packages for other kinds of projects in particular domains in the future—imagine options like PyTorch, geospatial, or NLP.

Switch to pyproject.toml

This is the new standard configuration file for Python project configuration, including packaging, building, and dev tool configs. Read more about that in this great blogpost, and you’ll see why CCDS V2 no longer comes setup.py.

Run with pipx

We all know that Python environment management is tricky. pipx is now a nice way to install and run system-level tools like CCDS. Now you don’t have to worry about what environment you’re in and if CCDS is polluting your project environment.

Tests for CCDS

One of the biggest lifts for V2 was an automated test suite. Testing environment creation and basic Makefile commands across different dependency managers and operating systems (Windows, macOS, and Ubuntu) was a complex engineering task, but the test suite gives us confidence that changes to CCDS work for users across a variety of platforms and tools.


What’s in progress

Beyond what we’re shipping with V2, we’ve got ambitions for what 2.1 will look like. We want to share some of what is in progress for future releases.

Continuous integration

CI providers have seen an enormous amount of churn since CCDS V1. Seriously. Travis CI was new on the block when this project started and is now basically fading out of existence. Github Actions is one of the most common CI systems for projects now, so we’re adding support to Cookiecutter Data Science. We anticipate having workflows for GitHub Actions that run linting and tests for your project.

Python boilerplate

CCDS V1 had rough boilerplate that was meant to be replaced as your project got built out. Some people found it confusing, so we have simplified the boilerplate and made it optional. We plan to make the project scaffold even better in the future. We’re thinking that scripts with good logging, project configuration, and CLI entrypoint practices will be an even better starting place. Plus, we’re going to add an option to add tests and test configuration as well.

CCDS-verse

One of the testaments to the success of CCDS is the multitude of forks. Right now, these forks can be hard to discover and gain traction. But, a lot of the forks can be incredibly useful for people working in a specific domain. We plan to add a section to the docs for the CCDS-verse. Add your public forks, pre-configs, and other customizations along with a description of the purpose. Make your process easily accessible and usable by other folks in your domain.

Customizable configuration

Lots of people have even more opinions about their tool stack and configuration. We plan to add an extensible post-generation hook that will copy over additional files from a location of your choice. This should make it easy to propagate your .editorconfig, linting/formatting configuration, .gitignore, and other configurations into every project you create.

Also, we realize that answering the same question at the command line every time you make a new project can be annoying. We plan to add support to reference a set of standard answers to the CCDS questions so you only get asked the unanswered ones interactively at the command line. This can make consistency within organizations even better.

Tighter Git integration

CCDS V1 was published at a time when Subversion and Mercurial still had substantial market share. These days, for better or worse, pretty much everything is Git. Even emerging alternatives such as jj have backends powered by git. We’re planning tighter Git integration. CCDS will ask for a remote (e.g., the url of a new project on GitHub) and then run git init, add the remote, and push the initial state all in one go.

Additionally, this means that we’re also more interested in tooling like pre-commit hooks. For example, one neat thing you can do manually now is run echo -e '#!/bin/bash\nmake lint\n' > .git/hooks/pre-commit && chmod +x .git/hooks/pre-commit. This will give you a pre-commit hook that runs make lint before committing to your project. Edit .git/hooks/pre-commit to add more commands like make test if you have them. There are more complex solutions like pre-commit, but our team does not favor them.


What’s still missing

We’re not done yet. In our experience, there are places where we haven’t found the right tool for the job yet. These areas continue to present struggles to data science teams and practitioners. 

The "right" DAG/command-runner tool

We all agree: learning make is a recipe for frustration. In fact, even installing it on Windows machines can be a little tricky. The incantations are magical, debugging is hard, whitespace is sneaky, errors are inscrutable, and the Makefile language doesn’t share syntax with many other tools. That said, we still use it for the CCDS default for now.

Here’s why: As data scientists we have two use cases for make, but they still don't cover the full breadth of features that a build tool like make is designed to have. The first is to keep track of the shell commands we run all the time. These commands can be long and vary across projects. We don’t want to remember them so things like make create_environment and make requirements that just work across all our projects is a delight. The second is to provide a directed acyclic graph (DAG) for data pipelining and model building. If you use the filesystem as an intermediate data store, you can easily DAG-ify your data cleaning, feature extraction, model training, and evaluation.

We dream of a make replacement that is easily installed on all OSes, supports writing shell scripts easily without arcane syntax, can express step dependencies, and can consume variables from the environment and as defined in the file. In our opinion, DAG tools designed to handle all of the kinds of processes a data scientist may want to do are overkill for the most data science projects. For most projects, starting with local execution a simple structure is good enough.

In this space tools like just or with the philosophy of the well-designed but abandoned bake are nice alternatives, but need to be bootstrapped onto any system you want to use them. On many, many systems, make is just available.

This brings us to make compatibility on Windows systems, which has been a constant thorn in our sides. Our test suite actually runs on Windows with make. There are multiple ways to install make onto Windows systems. We've included a section in the docs explicitly to help folks with this.

It’s a shame there is no universally available command runner that works across Windows and POSIX shells.

The "right" data management tool

Currently, our recommendation is to exclude the data folder from source control and to manage syncing data between machines separately from code, artifacts, and assets. This works ok for diligent teams that (1) manage versions of data appropriately, (2) have good syncing processes and hygiene, and (3) keep data assets well-organized and documented. This diligence can be a burden and requires consistent processes and standards. The make sync_data_up and make sync_data_down commands provide at least a minimal interface for sharing data amongst team members and nearly all teams have a cloud-hosted object store they can use for these purposes.

Unfortunately, alternatives that automatically sync datasets to a cloud backend are too idiosyncratic, uncommon, or expensive to recommend for all teams. Git LFS can be convenient for projects managed in git repositories, but is often much more expensive than putting data on an object store. Some specific tools are designed for these problems, but generally have separate data management commands and require opting in to larger infrastructure. These options include DVC, Pachyderm and Quilt.

Teams that primarily access hosted data or assets (e.g., through a database on a cloud provider or a data warehouse like Snowflake) have an easier task in that they can avoid any data syncing tools by always using the canonical data. For these teams, we recommend a data.py or db.py module in the project package that provides convenient data loading wrappers (and potentially querying and filtering) for end users who are working on analysis.

The "right" cloud infrastructure / orchestration tool

Our goal is reproducible environments for individual data scientists on their machine. That said, the CCDS unified interface means it should be easy to make push-button data science environments. We assumed our standard process of spinning up a cloud VM, git clone, make create_environment, make requirements, and jupyter notebook notebooks would become easily automatable, and in most organizations it would be push-button to get a data science environment with the right code, dependencies, and access to data. However, it is still the case that doing this consistently doing this across cloud vendors is not simple. Most cloud providers have locked-in environment management, data access, and hosted notebook solutions. It’s our view that most of these solutions both do too little and too much at the same time.

The "right" experiment management tool

Another area that has seen a lot of changes since the release of CCDS V1 has been experiment tracking and management. Particularly for deep learning applications, it can be important to run multiple model fitting steps and track the training curves, evaluation metrics, and versioned artifacts.

A number of open source projects and paid tools have experiment tracking functionality, for example MLflow, Sacred, Weights & Biases, and more. From our understanding of the landscape, no one tool is ready yet to recommend as a default. We would want a tool that is open source, does not have vendor lock-in, stores experiment data and artifacts are in standard file formats, integrates with existing code with minimal (or preferably no) changes, and has reporting tooling that can be self-hosted and/or point to an object store as the backend.

The "right" codebook tool

We’ve seen many real-world projects accumulate data in the data directory, much of which doesn’t naturally fit into the existing structures. For example, shapefiles and other supplementary input data, human-annotated versions of the raw data, or instructions for accessing datasets through an API.

Ideally, for these scenarios and for documenting just your everyday tabular datasets and good codebook tool would exist. The codebook tool would make it easy to (1) track the datasets that exist, (2) add commentary on the data, and (3) give information about what is in each column in tabular data, not just capture metadata for data files. In a perfect world, the tool would have an attractive web UI and be backed by a simple source-controllable declarative format.  We’ve experimented with building tools like this internally, but hope for a well-maintained open-source option.

The "right" environment management tool

The proliferation of Python packaging tools that also provide environment management (pyenv, Pipenv, conda) and the proliferation of newer tools for builds and also environment management (Hatch, PDM), and finally pip wrappers/substitutes that also provide environment management (Rye, uv) show how dissatisfied the community of Python users is with the existing options. Right now, we support using virtualenv, conda, and Pipenv. But we'll likely add some of these new tools soon as they gain traction. 

On this point, we strongly recommend a shell prompt customization tool that will tell you what Python environment you have active (for example, Starship). We also strongly recommend having zero extra packages installed in your base system Python environment. These two simple things will prevent an enormous amount of confusion and frustration.

The "right" project template management tool

Ok, maybe things get a little meta here, but there’s another missing "right" tool—the project template management tool. At the moment, Cookiecutter is not flexible enough for the goals of CCDS. It's too hard to provide branching logic chains, good command line help text, extensibility, and customization of command line output. This is one of the reasons we have our own command line program (even though the backend is still Cookiecutter).


Conclusion

Finally, we want to say a huge thank you to everyone that has used the project, recommended it to friends or colleagues, contributed changes, commented in issues, and overall helped improve the way that data science work gets done. There’s more to do, but with a community like this the journey is worth it.

Happy data sciencing!