blog toolscompetition

Concept To Clinic tools highlight: Travis CI

Continuous integration is a best practice for software development. We think it's a best practice for data science, too.

Isaac Slavitt
Co-founder

The software development community has centered around Continuous Integration and Continuous Delivery/Deployment as best practices in most development contexts. Many newer developers, particularly at software focused startups, have experienced this as a basic way of doing business. It was not always thus.

The Dark Age of software development

In Ye Olden Days, in any significantly large organization, shipping a new version of the organization's software — even a change to one single line of code — was viewed as A Big Deal.

Teams would often set a cutoff date at which point all of the pieces of an application would be expected to interact correctly and then the new version of the software would be built and released. Often, in the lead up to the cutoff date different developers would be working on different pieces of the application without frequently merging their changes. Sometimes, they would be using some form of version control. Rarely, there would be a suite of automated tests to verify that the software was working properly.

There were people whose main job was to integrate all these pieces, build all of the software, and work out any problems that led to "bad builds."

Sound crazy?

In fairness, "releasing" in 2000 might mean physically pressing the new, compiled release onto a CD and then mailing it out to users eligible to receive updates. Compare that to pushing out a new version of a centrally hosted Software as a Service (SaaS) web application.

Even so, this all worked out about as well as you would expect.

Run ALL the tests!

As testing and distributed version control became more established cultural norms in the development community, there was a concurrent push towards keeping the mainline branch "in the green," or always passing the test suite. As early as 2000, these mores were sufficiently ingrained to be included in The Joel Test which is a heuristic for outsiders to get a quick feel for how functional a software team is.

Human error is among the most common causes of failure across all industries. Having tests that can run and check the software is great, but relying on humans to remember to do that every time is setting a team up for failure.

What most teams want is to automate this process using a system that runs all of the tests on all of the branches every time something changes. This way, everybody has a feel for the status of feature branches currently in development, devlopers do their best to integrate changes into the mainline as quickly as possible, and the master version of the software is always in a ready-to-ship posture.

That kind of system is known as Continuous Integration, or CI for short. It has been something of a game changer for quality assurance and change management.

Deploy early, deploy often

Here's a sketch of the typical change cycle if the team uses a git feature branch workflow:

  • A developer decides to make a change or implement a new feature.
  • She checks out a branch from the HEAD commit on the repository's master branch.
  • She makes her changes. (This is the hard part.)
  • If the master branch has changed in the meantime, she rebases onto the current HEAD to reconcile any discrepancies that may have arisen since she started working.
  • She then opens a pull request which proposes to merge her changes onto the master branch.
  • The CI system runs all of the tests, including any new tests the developer added to verify her contributions work as intended. If all the tests pass, the build is marked as passing. If any test fails, the build fails loudly.
  • In many organizations, if the build passes then another developer does a code review on this pull request, and may ask questions or suggest changes.
  • If the build is passing and the code review is satisfactory, the pull request is accepted and the branch version is now the master version.

Once the team has a master branch that is trusted to be ready-to-ship at all times, the natural next step is to actually go ahead and ship it every time it changes! The same CI system that ran all of the tests can let us deploy that software using whatever automated process is already in place for deployment. (Deployment is automated, right???)

This practice is known as Continuous Deployment (or Continuous Delivery) or "CD" for short. And because most systems do both jobs, they're often referred to as CI/CD or just one of the two.

A brief aside: data science is software, people!

At DrivenData, we have been writing and speaking on the topic of reproducible data science for quite a while. We strongly believe that data folks should be adopting the hard-learned lessons of the software industry — and that we ignore them at our peril.

If you are interested in this, check out our Cookiecutter Data Science project which deals with project organization, and check out our PyData 2016 talk called Data science is software (embedded above).

How we used Travis CI with Docker Compose in the Concept to Clinic project

Every time a contributor opened or modified a pull request, we wanted the tests to run as described above. Additionally, in the interest of consistency we wanted contributors to adopt a certain PEP8-friendly code style, so we also automatically made successful flake8 and pycodestyle checks a mandatory part of a successful build.

That's a pretty normal use case, but we had a couple of interesting challenges to consider for the CI/CD pipeline.

For one, our project is set up as several services running in Docker containers which are configured and linked together by Docker Compose. Any tool we chose had to be flexible enough to support Docker Compose.

Also, like many data-intensive projects we had a large amount of data that was impractical to track using git, so we were using Github's Large File Storage (LFS) service which provides a familiar pull/push workflow for the project's data. On top of moving data back and forth, some of the tests were computationally intensive so in addition to reasonable network speed and bandwidth limits the tool needed to provide a bit of compute power without timing out or exceeding usage limits.

For our purposes, we chose Travis CI. Travis is a user-friendly and fully featured CI/CD platform that lets you define the entire CI/CD process in a single configuration file. Additionally, they provide their service for free to open source projects which we think is a great way to show appreciation for the OSS community!

Hurrah, a passing build!

Travis let us set this up quite easily by installing Docker and Docker Compose in the before_install section of our .travis.yml, and specify settings for Git LFS to avoid pulling all of the data every time.

Here's the script section of the configuration file, which shows all the steps that must give successful exit code in order for the build to pass:

script:
- flake8 interface
- pycodestyle interface
- flake8 prediction
- pycodestyle prediction
- sh tests/test_docker.sh

The last line is the bash script that tells Docker to run the tests, and has lines like this:

docker-compose -f local.yml run prediction coverage run --branch --omit=/**/dist-packages/*,src/tests/*,/usr/local/bin/pytest /usr/local/bin/pytest -rsx

This line runs the tests for the prediction service and also generates a code coverage report to show which lines of code the tests hit.

After every successful build, we also wanted to trigger a FOSSA license check and then notify the project's Gitter and our internal Slack about build status. Travis made it easy for us to add those webhooks and also put the Slack notification webhook (as well as other secret environment variables) in as encrypted values so that the .travis.yml could remain public without disclosing anything sensitive.

Continuous x ∀ x ∈ {integration,deployment}: an all-around good idea

We think CI/CD is a good development practice, and that Travis CI is a great tool for for the job. We're grateful to the Travis team for supporting this competition, and for their extremely quick and helpful support whenever we had questions.

Check them out!

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

tutorial

Goodnight Moon, Hello Early Literacy Screening Benchmark

In this guest post from the MIT Gabrieli Lab, we'll show you how to get started with the literacy screening challenge!

tutorial

Youth Mental Health: Automated Abstraction Benchmark

Learn how to process text narratives using open-source LLMs for the Youth Mental Health: Automated Abstraction challenge

winners

Meet the winners of Phase 1 of the PREPARE Challenge

Learn about the top datasets sourced for Phase 1 of the PREPARE Challenge.

resources

Teaching with DrivenData Competitions

Inspiration and resources for teaching students data science, machine learning, and AI skills with DrivenData competitions.

winners

Meet the winners of the Pose Bowl challenge

Learn about the top solutions submitted for the Pose Bowl: Spacecraft Detection and Pose Estimation Challenge.

winners

Meet the winners of the Water Supply Forecast Rodeo Hindcast Stage

Learn about the winning models for forecasting seasonal water supply from the first stage of the Water Supply Forecast Rodeo.

tools

Cookiecutter Data Science V2

Announcing the V2 release of Cookiecutter Data Science, the most widely adopted data science project template.

resources

How to make data science projects more open and inclusive

Key practices from the field of open science for making data science work more transparent, inclusive, and equitable.

winners

Meet the winners of the Kelp Wanted challenge

Dive into the solutions from the super segmenters who best detected kelp in Landsat imagery!

winners

Meet the winners of the SNOMED CT Entity Linking Challenge

Meet the winners with the best systems for detecting clinical terms in medical notes.

winners

Meet the winners of the Pale Blue Dot challenge

Learn about the top visuals created for the Pale Blue Dot: Visualization Challenge and the solvers behind them.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.