VisioMel Challenge: Predicting Melanoma Relapse¶

Welcome to the VisioMel Challenge: Predicting Melanoma Relapse! In this benchmark blog post, we will show you how to load the data for the challenge, make a prediction, and package your materials for a submission. Along the way, we'll include some tips that might be useful as you improve on this benchmark model and help solve a critical problem in pathology and melanoma treatment. This challenge is the successor to our TissueNet challenge, which also used Whole Slide Images. You may find some of the approaches used by winners of that challenge useful in developing your modeling approaches for this one.

In [ ]:

from joblib import dump
from pathlib import Path
import random

from cloudpathlib import S3Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
import pyvips
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

PIL will by default trigger a DecompressionBombError for images larger than 1GB to protect against DOS attacks. Several of our slide images are larger than 1GB, so we will remove the limit on maximum allowable size.

In [ ]:

Image.MAX_IMAGE_PIXELS = None

Loading the data¶

The data for this challenge are Whole Slide Images (WSIs), which are digitized versions of the slides pathologists examine to make a diagnosis and predict the probability of relapse. All the WSIs in this challenge are formatted as pyramidal tifs, which are images with several layers at different resolutions. For more information about WSIs and pyramidal tifs, see the Problem Description page and the Data Resources page.

Let's get started! First, let's import the metadata file for the training set (train_metadata.csv), which is available on the Data Download page.

Metadata¶

In [ ]:

DATA_DIR = Path(...)  # YOUR PATH HERE

In [ ]:

# import train_metadata
train_metadata = pd.read_csv(DATA_DIR / "train_metadata.csv", index_col="filename")
train_metadata

Out[ ]:

	age	sex	body_site	melanoma_history	breslow	ulceration	tif_cksum	tif_size	us_tif_url	eu_tif_url	as_tif_url
filename
1u4lhlqb.tif	[32:34[	2	thigh	YES	<0.8	NO	3028450373	747151312	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
rqumqnfp.tif	[46:48[	1	trunc	NO	[1 : 2[	NO	1294832049	591027450	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
bu5xt1xm.tif	[64:66[	2	face	NO	<0.8	NO	774102360	465947458	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
dibvu7wk.tif	[62:64[	2	forearm	NaN	[2 : 4[	YES	515827065	568174704	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
qsza4coh.tif	[90:92[	2	face	NO	[1 : 2[	NO	1541795099	1042691978	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
...	...	...	...	...	...	...	...	...	...	...	...
n7jd638y.tif	[68:70[	1	head/neck	NaN	[1 : 2[	NO	1949851255	136729436	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
8wg601oe.tif	[66:68[	1	head/neck	NaN	<0.8	NaN	1344207528	1603484314	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
1ejfd01b.tif	[80:82[	1	trunk	NaN	>=4	YES	4292211076	1423564688	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
2ztrjp20.tif	[42:44[	2	trunk	NaN	<0.8	NaN	2949914817	336489430	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...
hpzk3x2s.tif	[56:58[	1	upper limb/shoulder	NaN	<0.8	NO	378439451	364897020	s3://drivendata-competition-visiomel-public-us...	s3://drivendata-competition-visiomel-public-eu...	s3://drivendata-competition-visiomel-public-as...

1342 rows × 11 columns

The train metadata file contains a clinical variables (including age, sex, body_site, melanoma_history, breslow, and ulceration), fields that help verify the integrity of your image data download (tif_cksum, tif_size), and three URLs at which you can find the corresponding image (us_tif_url, eu_tif_url, and as_tif_url).

The test set metadata and images and metadata are only accessible in the runtime container and are mounted at code_execution/data, so test_metadata.csv will not have the tif_cksum, tif_size, us_tif_url, eu_tif_url, and as_tif_url fields. It will also not have breslow or ulceration, but these fields may still be useful in your modeling approach.

Now, let's see what the labels for the training set look like. train_labels.csv is also available on the Data Download page.

In [ ]:

# import train_labels.csv
train_labels = pd.read_csv(DATA_DIR / "train_labels.csv", index_col="filename")
train_labels

Out[ ]:

	relapse
filename
1u4lhlqb.tif	0
rqumqnfp.tif	0
bu5xt1xm.tif	0
dibvu7wk.tif	0
qsza4coh.tif	0
...	...
n7jd638y.tif	0
8wg601oe.tif	0
1ejfd01b.tif	0
2ztrjp20.tif	0
hpzk3x2s.tif	0

1342 rows × 1 columns

The field relapse in train_labels.csv indicates whether the slide contains a melanoma that relapsed within 5 years of initial diagnosis, where 0 indicates no relapse and 1 indicates relapse.

In [ ]:

train_labels.relapse.value_counts(dropna=False)

Out[ ]:

0    1129
1     213
Name: relapse, dtype: int64

Now, let's look at the images.

Images¶

The simplest way to download the images is using the AWS CLI. For more details, see the data_download_instructions.txt file on the Data Download page. For the purposes of this benchmark, we will access the images from the US East S3 bucket using cloudpathlib.

All the WSIs are in pyramidal tif format, which we can use the Python library pyvips to visualize. The pyramidal tifs in this dataset have anywhere from 7-12 layers, which are referred to in the pyvips library as "pages". A lower page number corresponds to a higher resolution. Since the highest resolution layer of the image can be quite large, we will load a few lower-resolution pages from each image.

Extracting as much information as possible from the slides while working with their unwieldy file size is a large part of this challenge. For more tips on how to work with pyramidal tifs, see the Data Resources page. You may also find the winning solutions from our last challenge with WSIs useful. Let's see what the different pages of a single image look like.

In [ ]:

# select a set of images to plot
SEED = 42
NUM_IMAGES = 3

# we'll use the US url
selected_image_paths = train_metadata.sample(
    random_state=SEED, n=NUM_IMAGES
).us_tif_url.values

In [ ]:

def visualize_page(page_num, image_paths):
    fig, axes = plt.subplots(1, len(image_paths), figsize=(10, 10))
    fig.tight_layout()

    for n, image_s3_path in enumerate(image_paths):
        fname = S3Path(image_s3_path).name
        print(f"Downloading {fname}")
        local_file = S3Path(image_s3_path).fspath

        n_frames = Image.open(local_file).n_frames
        img = pyvips.Image.new_from_file(local_file, page=page_num).numpy()

        axes[n].set_title(f"{fname}, page={page_num}\n {img.shape}\n")
        axes[n].imshow(img)

In [ ]:

# let's look at a few high-resolution pages from our selected images
visualize_page(3, selected_image_paths)

Downloading cgsctzbm.tif
Downloading ljhglzot.tif
Downloading 9v0bl8q7.tif

In [ ]:

# let's look at low-resolution pages from the same images
visualize_page(8, selected_image_paths)

Downloading cgsctzbm.tif
Downloading ljhglzot.tif
Downloading 9v0bl8q7.tif

Note the difference in the dimensions (the numbers along the left and bottom edges) of the higher versus lower pages for these images. There is also variation in the shapes, sizes, and colors of these slides. Some images also have artifacts that are unrelated to the task of prediction, such as black marks from the slide holder or blurry regions. The Data Resources page provides some more information on the variations that your solution will have to take into account.

The organizers of the challenge, VisioMel, have also provided annotated versions of a few example images in annotations.csv and visualized_annotations.pdf, which are available on the Data Download page.

Training a model¶

Since the purpose of this blog post is to show how to access the data and prepare and submit your model files, we will be using a very simple random forest model that uses only a few of the clinical variables available to us in both the train and test metadata. A successful model will need to use the images.

Preprocessing¶

To start, let's do some feature engineering! We'll first take a look at the unique values for each variable in the metadata. These variables are defined on the Problem Description page.

In [ ]:

train_metadata.age.unique()

Out[ ]:

array(['[32:34[', '[46:48[', '[64:66[', '[62:64[', '[90:92[', '[74:76[',
       '[68:70[', '[54:56[', '[80:82[', '[72:74[', '[40:42[', '[58:60[',
       '[86:88[', '[60:62[', '[78:80[', '[66:68[', '[38:40[', '[28:30[',
       '[48:50[', '[70:72[', '[42:44[', '[44:46[', '[30:32[', '[52:54[',
       '[56:58[', '[76:78[', '[50:52[', '[82:84[', '[84:86[', '[26:28[',
       '[18:20[', '[34:36[', '[36:38[', '[92:94[', '[88:90[', '[24:26[',
       '[20:22[', '[94:96[', '[22:24[', '[98:100[', '[16:18['],
      dtype=object)

In [ ]:

train_metadata.sex.unique()

Out[ ]:

array([2, 1])

In [ ]:

train_metadata.body_site.unique()

Out[ ]:

array(['thigh', 'trunc', 'face', 'forearm', 'arm', 'leg', 'hand', nan,
       'foot', 'sole', 'finger', 'neck', 'toe', 'seat', 'scalp', 'nail',
       'trunk', 'lower limb/hip', 'hand/foot/nail', 'head/neck',
       'upper limb/shoulder'], dtype=object)

In [ ]:

train_metadata.melanoma_history.unique()

Out[ ]:

array(['YES', 'NO', nan], dtype=object)

So far, we've learned:

age is available to us in two-year intervals
melanoma_history is sometimes nan
body_site categories are not all at the same level of aggregation - for example, some slides are labeled as being from "hand/foot/nail" while some slides are labeled as from the hand, foot, and nail individually.

Let's do the following:

convert age to integer values
convert melanoma_history to dummy variables

For this first pass, we will not use body_site.

In [ ]:

# define an encoder for melanoma_history

enc = OneHotEncoder(drop="first", sparse_output=False)
enc.fit(np.array(train_metadata["melanoma_history"]).reshape(-1, 1))
enc.get_feature_names_out()

Out[ ]:

array(['x0_YES', 'x0_nan'], dtype=object)

In [ ]:

def preprocess_feats(df=train_metadata):
    feats = df.copy()
    # take the first age in the range and convert to integer
    feats["age_int"] = feats.age.str.slice(1, 3).astype(int)
    X = pd.concat(
        [
            feats[["age_int", "sex"]],
            pd.DataFrame(
                enc.transform(np.array(feats["melanoma_history"]).reshape(-1, 1)),
                columns=enc.get_feature_names_out(),
                index=feats.index,
            ),
        ],
        axis=1,
    )

    return X

In [ ]:

# preprocess features

X = preprocess_feats(train_metadata)
X

Out[ ]:

	age_int	sex	x0_YES	x0_nan
filename
1u4lhlqb.tif	32	2	1.0	0.0
rqumqnfp.tif	46	1	0.0	0.0
bu5xt1xm.tif	64	2	0.0	0.0
dibvu7wk.tif	62	2	0.0	1.0
qsza4coh.tif	90	2	0.0	0.0
...	...	...	...	...
n7jd638y.tif	68	1	0.0	1.0
8wg601oe.tif	66	1	0.0	1.0
1ejfd01b.tif	80	1	0.0	1.0
2ztrjp20.tif	42	2	0.0	1.0
hpzk3x2s.tif	56	1	0.0	1.0

1342 rows × 4 columns

In [ ]:

# take labels from train_labels.csv

y = train_labels.relapse

Model training and calibration¶

For a first pass at training, we will have a simple stratified 80-20 split between train and test. The stratified split helps ensure that the proportion of relapse cases are similar between our train and test set and that we are performing our model training on a representative subset of the data.

In [ ]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=SEED, stratify=y
)

In [ ]:

X_train

Out[ ]:

	age_int	sex	x0_YES	x0_nan
filename
f9oktt5o.tif	54	2	0.0	0.0
rdvuvi1x.tif	66	2	0.0	0.0
o37l6qb1.tif	58	1	0.0	1.0
2bg6adrf.tif	80	2	0.0	1.0
031viy0y.tif	28	2	0.0	0.0
...	...	...	...	...
mhcm6bc6.tif	44	2	0.0	1.0
t013dhcf.tif	58	1	0.0	0.0
t1bsrqwq.tif	74	1	0.0	1.0
jc3bw7c6.tif	32	2	0.0	1.0
m5xdtz4p.tif	58	2	0.0	1.0

1073 rows × 4 columns

Let's check the distribution of relapse cases between the two datasets to make sure that the stratification worked.

In [ ]:

y_train.value_counts(normalize=True)

Out[ ]:

0    0.841566
1    0.158434
Name: relapse, dtype: float64

In [ ]:

y_test.value_counts(normalize=True)

Out[ ]:

0    0.840149
1    0.159851
Name: relapse, dtype: float64

This looks like a good split, so let's go ahead and train our model! For this benchmark, we will be using a simple random forest classifier using scikit-learn.

We will also calibrate our output from the model to reflect probabilities. Calibration is important because certain models, including the random forest classifier we have chosen, output values that cannot directly be interpreted as probabilities. Calibrating is a way of making the outputs of the classifier map more closely to the actual probability distribution of the dataset. This is particularly important for the competition metric, log loss, which heavily penalizes confident but wrong predictions.

We will use another scikit-learn built-in method, CalibratedClassifierCV, to perform the calibration.

There are two methods for calibration to choose from in the CalibratedClassifierCV method, "sigmoid" and "isotonic." The sigmoid option uses Platt scaling, which uses a logistic regression model, and the isotonic option uses isotonic regression, which uses a weighted least-squares regression model. Since Platt scaling is simpler and the documentation suggests avoiding isotonic regresion for too few calibration samples, let's use the "sigmoid" option.

In [ ]:

# fit a calibrated random forest model

rf = RandomForestClassifier(random_state=SEED)
calibrated_rf = CalibratedClassifierCV(rf, method="sigmoid", cv=5)
calibrated_rf.fit(X_train, y_train)

Out[ ]:

CalibratedClassifierCV(cv=5, estimator=RandomForestClassifier(random_state=42))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now, let's see how well our model performs on our validation set!

Predicting on validation set¶

The metric for this competition is log loss, also known as cross-entropy loss. We are specifying eps=1e-16 here to match the scoring on the competition platform.

In [ ]:

def score(y_true, y_pred):
    return log_loss(y_true, y_pred, eps=1e-16)

In [ ]:

preds = calibrated_rf.predict_proba(X_test)[:, 1]
score(y_test, preds)

Out[ ]:

0.4378480226166271

Just to compare, let's see how well our model would have performed without calibration.

In [ ]:

# fit a random forest model and calculate score

rf = RandomForestClassifier(random_state=SEED)
rf.fit(X_train, y_train)
preds = rf.predict_proba(X_test)[:, 1]
score(y_test, preds)

Out[ ]:

1.4784049553641676

This is quite a difference! Keeping this in mind, let's train on our entire dataset.

Retraining on full training set¶

In [ ]:

# train on entire dataset

rf = RandomForestClassifier(random_state=SEED)
calibrated_rf = CalibratedClassifierCV(rf, method="sigmoid", cv=5)
calibrated_rf.fit(X, y)

Out[ ]:

CalibratedClassifierCV(cv=5, estimator=RandomForestClassifier(random_state=42))

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now that we've trained our model, let's prepare our submission!

Code submission format¶

Since this is a code execution competition, will will submit our model weights and code rather than predictions. For more details, see the Code Submission Format page. Our first step will be to save out our encoder and the model weights for the calibrated random forest model we have just trained on the entire dataset. We will save these as .joblib files, though there are other formats you can use.

In [ ]:

Path("submission/assets").mkdir(exist_ok=True, parents=True)

# save out model to assets directory in submission folder
dump(calibrated_rf, "submission/assets/random_forest_model.joblib")

# save out encoder to assets directory as well
dump(enc, "submission/assets/history_encoder.joblib")

Out[ ]:

['submission/assets/history_encoder.joblib']

Your submission should include a main.py file that makes predictions for an unseen test set and outputs the predictions into a file called submission.csv that exactly matches the submission_format.csv available on the Data Download page. Your prediction function must output a single likelihood between 0.0 and 1.0 for each test set record.

Let's write out our main.py using the preprocessing and modeling assets we saved.

In [ ]:

%%writefile submission/main.py

"""Solution for VisioMel Challenge"""

from joblib import load
from pathlib import Path

from loguru import logger
import numpy as np
import pandas as pd


DATA_ROOT = Path("/code_execution/data/")


def preprocess_feats(df, enc):
    feats = df.copy()
    feats["age_int"] = feats.age.str.slice(1, 3).astype(int)
    X = pd.concat(
        [
            feats[["age_int", "sex"]],
            pd.DataFrame(
                enc.transform(np.array(feats["melanoma_history"]).reshape(-1, 1)),
                columns=enc.get_feature_names_out(),
                index=feats.index,
            ),
        ],
        axis=1,
    )
    return X


def main():
    # load sumission format
    submission_format = pd.read_csv(DATA_ROOT / "submission_format.csv", index_col=0)

    # load test_metadata
    test_metadata = pd.read_csv(DATA_ROOT / "test_metadata.csv", index_col=0)

    logger.info("Loading feature encoder and model")
    calibrated_rf = load("assets/random_forest_model.joblib")
    history_encoder = load("assets/history_encoder.joblib")

    logger.info("Preprocessing features")
    processed_features = preprocess_feats(test_metadata, history_encoder)

    logger.info("Checking test feature filenames are in the same order as the submission format")
    assert (processed_features.index == submission_format.index).all()

    logger.info("Checking test feature columns align with loaded model")
    assert (processed_features.columns == calibrated_rf.feature_names_in_).all()

    logger.info("Generating predictions")
    submission_format["relapse"] = calibrated_rf.predict_proba(processed_features)[:,1]

    # save as "submission.csv" in the root folder, where it is expected
    logger.info("Writing out submission.csv")
    submission_format.to_csv("submission.csv")


if __name__ == "__main__":
    main()

Writing submission/main.py

Now, it's time to zip up our main.py and all of our assets!

In [ ]:

# zip up submission
! cd submission; zip -r ../submission.zip ./*

  adding: assets/ (stored 0%)
  adding: assets/history_encoder.joblib (deflated 52%)
  adding: assets/random_forest_model.joblib (deflated 82%)
  adding: main.py (deflated 60%)

Before you finalize your code for submission, make sure you:

Include a main.py in the root directory of your submission zip. There can be extra files with more code that is called.
Include any model weights in your submission zip as there will be no network access.
Write out a submissions.csv to the root directory when inference is finished, matching the submission format exactly.
Log general information that will help you debug your submission.
Test your submission locally and using the smoke test functionality on the platform.
Consider ways to optimize your pipeline so that it runs end-to-end in under 8 hours.

And make sure you do not:

Read from locations other than your solution directory.
Use information from other images in the test set in making a prediction for a given tif file.
Print or log any information about the test metadata or test images, including specific data values and aggregations such as sums, means, or counts.

Making a submission¶

To submit your .zip file, head over to the Submissions page. You will see two options for uploading your file: "Normal submission" or "smoke test". A "smoke test" is a short test against a small number of images that allows you to make sure the format for your submission is correct and that it generates predictions within a time period that is reasonable. A "normal" submission contributes toward your score on the leaderboard, and must run in under 8 hours. Please refer to the guidance in the runtime repo for this challenge to see more details.

One more point we want to check before we submit our code is that all of the libraries it requires are in the runtime. You can see more guidance about making sure that your scripts can run in the runtime environment here.

Testing a submission locally¶

It is a good idea to test our submission locally before we try to submit to the platform. To test your submission locally, follow the steps from the runtime repo:

First, make sure you have set up the prerequisites.
Download the code execution development dataset and extract it to data.
Download the official competition Docker image using make pull
Save all of your submission files, including the required main.py script, in the submission_src folder of the runtime repository. Make sure any needed model weights and other assets are saved in submission_src as well.

Create a submission/submission.zip file containing your code and model assets:

$ make pack-submission 
mkdir -p submission/
cd submission_src; zip -r ../submission/submission.zip ./*
adding: solution.py (deflated 73%)

Launch an instance of the competition Docker image, and run the same inference process that will take place in the official runtime using make test-submission
Run scripts/score.py on your predictions:
```
python scripts/score.py
```

For this notebook, we will go directly to a smoke test first. You're allowed 3 smoke tests per day.

Making a smoke test submission¶

When you upload a submission, you will be able to see the progress of your submission under "Status of Code Execution Jobs." A successful submission's status will change from Pending, to Starting, to Running, to Completed. Once your submission is Running, you will be able to see your logs. If you suspect your submission will run longer than 8 hours or will not succeed, you can click "Cancel current code job" and it will not count against your submission limit.

Making a leaderboard submission¶

Once the smoke test is Completed successfully, we can move onto making our real submission! Instead of selecting "Smoke test", select "Normal."

When we navigate to the Submissions tab, we can see our score:

Not a bad start, but there's plenty of room for improvement! We are excited to see what you come up with. Head over to the community forum to chat with other participants. Good luck!

VisioMel Challenge: Predicting Melanoma Relapse - Benchmark

VisioMel Challenge: Predicting Melanoma Relapse¶

Loading the data¶

Metadata¶

Images¶

Training a model¶

Preprocessing¶

Model training and calibration¶

Predicting on validation set¶

Retraining on full training set¶

Code submission format¶

Making a submission¶

Testing a submission locally¶

Making a smoke test submission¶

Making a leaderboard submission¶

Tags

Latest posts

Meet the winners of the AI for Advancing Instruction Challenge

Automating wildlife monitoring with Zamba & Zamba Cloud

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

Community Spotlight: Kirill Brodt

Jump-starting data infrastructure and in-house data expertise

A production application to support survivors of human trafficking

Life beyond the leaderboard

(Tech) Infrastructure Week for the Nonprofit Sector

Meet the winners of Phase 2 of the PREPARE Challenge

AI sauce on everything: Reflections on ASU+GSV 2025

Open-source packages for using speech data in ML

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Crowdsourcing solutions for AI-assisted early literacy screening

Where to find a data job for a good cause

Meet the Winners of the Youth Mental Health Narratives Challenge

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

10 takeaways from 10 years of data science for social good

Mapping agricultural trends in Yemen during crisis

Making higher education data more accessible

Work with us to build a better world

VisioMel Challenge: Predicting Melanoma Relapse¶

Loading the data¶

Metadata¶

Images¶

Training a model¶

Preprocessing¶

Model training and calibration¶

Predicting on validation set¶

Retraining on full training set¶

Code submission format¶

Making a submission¶

Testing a submission locally¶

Making a smoke test submission¶

Making a leaderboard submission¶

Tags

Stay updated

Latest posts

Work with us to build a better world