So you want to harness the power of machine learning but need a place to start? We've got just the task for you: detecting heart disease! Heart disease is the number one cause of death worldwide, so if you're looking to use data science for good you've come to the right place. To learn how to prevent heart disease we must first learn to reliably detect it. That's where you––yes you––come in!

To join the competition, follow this link.

In our brand new warm up competition we're asking you to predict the presence or absence of heart disease given various data about a patient, including resting blood pressure, maximum heart rate, and EKG readings, as well as other information like age and sex. The data comes from the Statlog Heart dataset via the UCI Machine Learning repository. This is one of the smallest, least complex datasets on DrivenData, and a great place to dive into the world of data science competitions.

In this post, we'll walk through a very simple first pass model for predicting heart disease from patient data, showing you how to load the data, make some predictions, and then submit those predictions to the competition.

To get started, we import libraries for loading, manipulating, and visualizing the data.

In [1]:

%matplotlib inline

from pathlib import Path

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:

DATA_DIR = Path("..", "data", "final", "public")

Loading the Data¶

Image and quote from the [Centers for Disease Control and Prevention](https://www.cdc.gov/heartdisease/facts.htm): As plaque builds up in the arteries of a person with heart disease, the inside of the arteries begins to narrow, which lessens or blocks the flow of blood. Plaques can also rupture (break open) and when they do a blood clot can form on the plaque, blocking the flow of blood.

On the data download page, we provide everything you need to get started:

Training Values: These are the features you'll use to train a model. There are 13 features in data, including resting blood pressure, maximum heart rate, and EKG readings, as well as other information like age and sex. Each patient is identified by a unique (random) patient_id, which you can use as an index.
Training Labels: These are the labels. Every patient_id in the training values data has a corresponding label in this file. A 0 indicates no heart disease present, whereas a 1 indicates the presence of heart disease.
Test Values: These are the features you'll use to make predictions after training a model. We don't give you the labels for these samples, it's up to you to generate probabilities of the presence or ansence of heart disease for these patient_ids!
Submission Format: This gives us the filenames and columns of our submission prediction, filled with all 0.5 as a baseline. Your submission to the leaderboard must be in this exact form (with different prediction values, of course) in order to be scored successfully!

Since this is a benchmark, we're only going to use a subset of the features in the dataset. It's up to you to take advantage of all the information!

In [3]:

# for training our model
train_values = pd.read_csv(DATA_DIR / "train_values.csv", index_col="patient_id")
train_labels = pd.read_csv(DATA_DIR / "train_labels.csv", index_col="patient_id")

Let's take a look at the head of our training features

In [4]:

train_values.head()

Out[4]:

	slope_of_peak_exercise_st_segment	thal	resting_blood_pressure	chest_pain_type	num_major_vessels	fasting_blood_sugar_gt_120_mg_per_dl	resting_ekg_results	serum_cholesterol_mg_per_dl	oldpeak_eq_st_depression	sex	age	max_heart_rate_achieved	exercise_induced_angina
patient_id
0z64un	1	normal	128	2	0	0	2	308	0.0	1	45	170	0
ryoo3j	2	normal	110	3	0	0	0	214	1.6	0	54	158	0
yt1s1x	1	normal	125	4	3	0	2	304	0.0	1	77	162	1
l2xjde	1	reversible_defect	152	4	0	0	0	223	0.0	1	40	181	0
oyt4ek	3	reversible_defect	178	1	0	0	2	270	4.2	1	59	145	0

In [5]:

train_values.dtypes

Out[5]:

slope_of_peak_exercise_st_segment         int64
thal                                     object
resting_blood_pressure                    int64
chest_pain_type                           int64
num_major_vessels                         int64
fasting_blood_sugar_gt_120_mg_per_dl      int64
resting_ekg_results                       int64
serum_cholesterol_mg_per_dl               int64
oldpeak_eq_st_depression                float64
sex                                       int64
age                                       int64
max_heart_rate_achieved                   int64
exercise_induced_angina                   int64
dtype: object

And the labels

In [6]:

train_labels.head()

Out[6]:

	heart_disease_present
patient_id
0z64un	0
ryoo3j	0
yt1s1x	1
l2xjde	1
oyt4ek	0

Explore the Data¶

In [7]:

train_labels.heart_disease_present.value_counts().plot.bar(
    title="Number with Heart Disease"
)

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x1143b33c8>

The data is relatively well-balanced, so we won't take any steps here to equalize the classes.

In [8]:

selected_features = ["age", "sex", "max_heart_rate_achieved", "resting_blood_pressure"]
train_values_subset = train_values[selected_features]

A quick look at the relationships between our features and labels

In [9]:

sns.pairplot(
    train_values.join(train_labels), hue="heart_disease_present", vars=selected_features
)

Out[9]:

<seaborn.axisgrid.PairGrid at 0x10f246978>

The Error Metric – LogLoss¶

The metric in this competition is logarithmic loss, or log loss, which uses the probabilities of class predictions and the true class labels to generate a number that is closer to zero for better models, and exactly zero for a perfect model.

You can see from the formula for log loss that highly confident (probability close to one) wrong answers will contribute more to the total log loss number. This property of log loss makes it more informative alternative to accuracy. Below we'll use the Scikit Learn implementation of log loss to evaluate our model before submitting to the leaderboard.

Build the Model¶

When it comes to classic first pass models, few can contend with logisitc regression. This linear model is fast to train, easy to understand, and typically does pretty well "out of the box".

Below we'll combine the Scikit Learn logistic regression model with a preprocessing tool using Pipeline and GridSearchCV––two of sklearn's tools for streamlining the process of model training and hyperparamter optimization. You may be new to machine learning, but there's no better time to start developing good habits, and using pipelines is a good habit that you won't regret learning.

Logisitc Regression¶

In [10]:

# for preprocessing the data
from sklearn.preprocessing import StandardScaler

# the model
from sklearn.linear_model import LogisticRegression

# for combining the preprocess with model training
from sklearn.pipeline import Pipeline

# for optimizing parameters of the pipeline
from sklearn.model_selection import GridSearchCV

In Pipelines you pass a list of "steps" in the form of tuples with a step name in the first field and the object associated to the step in the second. Pipeline then creates one "estimator" object that you can pass data into for training and predcition.

In [11]:

pipe = Pipeline(steps=[("scale", StandardScaler()), ("logistic", LogisticRegression())])
pipe

Out[11]:

Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logistic', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

Grid search allows you try out different parameters in the pipeline. Below, we try out different values for Scikit Learn's logistic regression "C" parameter as well as its regularization method.

To specify that these parameters are for the LogisticRegression part of the pipeline and not the StandardScaler part, the keys in our parameter grid (a python dictionary) take the form stepname__parametername. (Note the double underscore!)

The CV in GridSearchCV is another best practice to prevent overfitting your model.

In [12]:

param_grid = {
    "logistic__C": [0.0001, 0.001, 0.01, 1, 10],
    "logistic__penalty": ["l1", "l2"],
}
gs = GridSearchCV(estimator=pipe, param_grid=param_grid, cv=3)

With the parameter grid we've created and cross-validation, we're about to test 30 different models and take the best one!

In [13]:

gs.fit(train_values_subset, train_labels.heart_disease_present)

Out[13]:

GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logistic', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'logistic__C': [0.0001, 0.001, 0.01, 1, 10], 'logistic__penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

Let's take a look at the best parameters.

In [14]:

gs.best_params_

Out[14]:

{'logistic__C': 1, 'logistic__penalty': 'l2'}

And the in-sample log loss score. Notice that since log loss wants the class probilities we call predict_proba and not simply predict, which would return the predicted labels, not their probabilites.

In [15]:

from sklearn.metrics import log_loss

in_sample_preds = gs.predict_proba(train_values[selected_features])
log_loss(train_labels.heart_disease_present, in_sample_preds)

Out[15]:

0.546686009615344

Time to Predict and Submit¶

For the log loss, we'll be using the class probabilities, not the class predictions. This means we call the predict_proba method as above. Further, we'll take every row in the second column, which corresponds to the positive case of heart_disease_present.

Let's load up the data, process it, and see what we get on the leaderboard.

In [16]:

test_values = pd.read_csv(DATA_DIR / "test_values.csv", index_col="patient_id")

Select the subset of features we used to train the model.

In [17]:

test_values_subset = test_values[selected_features]

Make Predictions¶

Again, note that we take only the second column.

In [18]:

predictions = gs.predict_proba(test_values_subset)[:, 1]

Save Submission¶

We can use the column name and index from the submission format to ensure our predictions are in the form.

In [19]:

submission_format = pd.read_csv(
    DATA_DIR / "submission_format.csv", index_col="patient_id"
)

In [20]:

my_submission = pd.DataFrame(
    data=predictions, columns=submission_format.columns, index=submission_format.index
)

In [21]:

my_submission.head()

Out[21]:

	heart_disease_present
patient_id
olalu7	0.604407
z9n6mx	0.053553
5k4413	0.752436
mrg7q5	0.097412
uki4do	0.791249

In [22]:

my_submission.to_csv("submission.csv")

Check the head of the saved file

In [23]:

!head submission.csv

patient_id,heart_disease_present
olalu7,0.6044072540043635
z9n6mx,0.05355289481734789
5k4413,0.7524357630080116
mrg7q5,0.09741203030747234
uki4do,0.7912486789601635
kev1sk,0.08747047862618101
9n6let,0.5409852616595142
jxmtyg,0.6565676676376131
51s2ff,0.301744665671968

Submit to leaderboard¶

Woohoo! It's a start! And that's exactly what we intend with these benchmarks. We're sure you'll be able to top this model in no time, and we can't wait to see what you come up with. Happy importing!

Machine Learning With A Heart - Benchmark

Loading the Data¶

Explore the Data¶

The Error Metric – LogLoss¶

Build the Model¶

Logisitc Regression¶

Time to Predict and Submit¶

Make Predictions¶

Save Submission¶

Submit to leaderboard¶

Tags

Latest posts

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

Community Spotlight: Kirill Brodt

Jump-starting data infrastructure and in-house data expertise

A production application to support survivors of human trafficking

Life beyond the leaderboard

(Tech) Infrastructure Week for the Nonprofit Sector

Meet the winners of Phase 2 of the PREPARE Challenge

AI sauce on everything: Reflections on ASU+GSV 2025

Open-source packages for using speech data in ML

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Crowdsourcing solutions for AI-assisted early literacy screening

Where to find a data job for a good cause

Meet the Winners of the Youth Mental Health Narratives Challenge

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

10 takeaways from 10 years of data science for social good

Mapping agricultural trends in Yemen during crisis

Making higher education data more accessible

Goodnight Moon, Hello Early Literacy Screening Benchmark

Building a private LLM sandbox for NCSC

Work with us to build a better world

Loading the Data¶

Explore the Data¶

The Error Metric – LogLoss¶

Build the Model¶

Logisitc Regression¶

Time to Predict and Submit¶

Make Predictions¶

Save Submission¶

Submit to leaderboard¶

Tags

Stay updated

Latest posts

Work with us to build a better world