VisioMel Challenge: Predicting Melanoma Relapse - Benchmark

by Isha Shah

VisioMel Challenge: Predicting Melanoma Relapse

Welcome to the VisioMel Challenge: Predicting Melanoma Relapse! In this benchmark blog post, we will show you how to load the data for the challenge, make a prediction, and package your materials for a submission. Along the way, we'll include some tips that might be useful as you improve on this benchmark model and help solve a critical problem in pathology and melanoma treatment. This challenge is the successor to our TissueNet challenge, which also used Whole Slide Images. You may find some of the approaches used by winners of that challenge useful in developing your modeling approaches for this one.

In [1]:
from joblib import dump
from pathlib import Path
import random

from cloudpathlib import S3Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
import pyvips
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

PIL will by default trigger a DecompressionBombError for images larger than 1GB to protect against DOS attacks. Several of our slide images are larger than 1GB, so we will remove the limit on maximum allowable size.

In [2]:

Loading the data

The data for this challenge are Whole Slide Images (WSIs), which are digitized versions of the slides pathologists examine to make a diagnosis and predict the probability of relapse. All the WSIs in this challenge are formatted as pyramidal tifs, which are images with several layers at different resolutions. For more information about WSIs and pyramidal tifs, see the Problem Description page and the Data Resources page.

Let's get started! First, let's import the metadata file for the training set (train_metadata.csv), which is available on the Data Download page.


In [3]:
In [4]:
# import train_metadata
train_metadata = pd.read_csv(DATA_DIR / "train_metadata.csv", index_col="filename")
age sex body_site melanoma_history breslow ulceration tif_cksum tif_size us_tif_url eu_tif_url as_tif_url
1u4lhlqb.tif [32:34[ 2 thigh YES <0.8 NO 3028450373 747151312 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
rqumqnfp.tif [46:48[ 1 trunc NO [1 : 2[ NO 1294832049 591027450 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
bu5xt1xm.tif [64:66[ 2 face NO <0.8 NO 774102360 465947458 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
dibvu7wk.tif [62:64[ 2 forearm NaN [2 : 4[ YES 515827065 568174704 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
qsza4coh.tif [90:92[ 2 face NO [1 : 2[ NO 1541795099 1042691978 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
... ... ... ... ... ... ... ... ... ... ... ...
n7jd638y.tif [68:70[ 1 head/neck NaN [1 : 2[ NO 1949851255 136729436 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
8wg601oe.tif [66:68[ 1 head/neck NaN <0.8 NaN 1344207528 1603484314 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
1ejfd01b.tif [80:82[ 1 trunk NaN >=4 YES 4292211076 1423564688 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
2ztrjp20.tif [42:44[ 2 trunk NaN <0.8 NaN 2949914817 336489430 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...
hpzk3x2s.tif [56:58[ 1 upper limb/shoulder NaN <0.8 NO 378439451 364897020 s3://drivendata-competition-visiomel-public-us... s3://drivendata-competition-visiomel-public-eu... s3://drivendata-competition-visiomel-public-as...

1342 rows × 11 columns

The train metadata file contains a clinical variables (including age, sex, body_site, melanoma_history, breslow, and ulceration), fields that help verify the integrity of your image data download (tif_cksum, tif_size), and three URLs at which you can find the corresponding image (us_tif_url, eu_tif_url, and as_tif_url).

The test set metadata and images and metadata are only accessible in the runtime container and are mounted at code_execution/data, so test_metadata.csv will not have the tif_cksum, tif_size, us_tif_url, eu_tif_url, and as_tif_url fields. It will also not have breslow or ulceration, but these fields may still be useful in your modeling approach.

Now, let's see what the labels for the training set look like. train_labels.csv is also available on the Data Download page.

In [5]:
# import train_labels.csv
train_labels = pd.read_csv(DATA_DIR / "train_labels.csv", index_col="filename")
1u4lhlqb.tif 0
rqumqnfp.tif 0
bu5xt1xm.tif 0
dibvu7wk.tif 0
qsza4coh.tif 0
... ...
n7jd638y.tif 0
8wg601oe.tif 0
1ejfd01b.tif 0
2ztrjp20.tif 0
hpzk3x2s.tif 0

1342 rows × 1 columns

The field relapse in train_labels.csv indicates whether the slide contains a melanoma that relapsed within 5 years of initial diagnosis, where 0 indicates no relapse and 1 indicates relapse.

In [6]:
0    1129
1     213
Name: relapse, dtype: int64

Now, let's look at the images.


The simplest way to download the images is using the AWS CLI. For more details, see the data_download_instructions.txt file on the Data Download page. For the purposes of this benchmark, we will access the images from the US East S3 bucket using cloudpathlib.

All the WSIs are in pyramidal tif format, which we can use the Python library pyvips to visualize. The pyramidal tifs in this dataset have anywhere from 7-12 layers, which are referred to in the pyvips library as "pages". A lower page number corresponds to a higher resolution. Since the highest resolution layer of the image can be quite large, we will load a few lower-resolution pages from each image.

Extracting as much information as possible from the slides while working with their unwieldy file size is a large part of this challenge. For more tips on how to work with pyramidal tifs, see the Data Resources page. You may also find the winning solutions from our last challenge with WSIs useful. Let's see what the different pages of a single image look like.

In [7]:
# select a set of images to plot
SEED = 42

# we'll use the US url
selected_image_paths = train_metadata.sample(
    random_state=SEED, n=NUM_IMAGES
In [8]:
def visualize_page(page_num, image_paths):
    fig, axes = plt.subplots(1, len(image_paths), figsize=(10, 10))

    for n, image_s3_path in enumerate(image_paths):
        fname = S3Path(image_s3_path).name
        print(f"Downloading {fname}")
        local_file = S3Path(image_s3_path).fspath

        n_frames =
        img = pyvips.Image.new_from_file(local_file, page=page_num).numpy()

        axes[n].set_title(f"{fname}, page={page_num}\n {img.shape}\n")
In [9]:
# let's look at a few high-resolution pages from our selected images
visualize_page(3, selected_image_paths)
Downloading cgsctzbm.tif
Downloading ljhglzot.tif
Downloading 9v0bl8q7.tif
In [10]:
# let's look at low-resolution pages from the same images
visualize_page(8, selected_image_paths)
Downloading cgsctzbm.tif
Downloading ljhglzot.tif
Downloading 9v0bl8q7.tif

Note the difference in the dimensions (the numbers along the left and bottom edges) of the higher versus lower pages for these images. There is also variation in the shapes, sizes, and colors of these slides. Some images also have artifacts that are unrelated to the task of prediction, such as black marks from the slide holder or blurry regions. The Data Resources page provides some more information on the variations that your solution will have to take into account.

The organizers of the challenge, VisioMel, have also provided annotated versions of a few example images in annotations.csv and visualized_annotations.pdf, which are available on the Data Download page.

Training a model

Since the purpose of this blog post is to show how to access the data and prepare and submit your model files, we will be using a very simple random forest model that uses only a few of the clinical variables available to us in both the train and test metadata. A successful model will need to use the images.


To start, let's do some feature engineering! We'll first take a look at the unique values for each variable in the metadata. These variables are defined on the Problem Description page.

In [11]:
array(['[32:34[', '[46:48[', '[64:66[', '[62:64[', '[90:92[', '[74:76[',
       '[68:70[', '[54:56[', '[80:82[', '[72:74[', '[40:42[', '[58:60[',
       '[86:88[', '[60:62[', '[78:80[', '[66:68[', '[38:40[', '[28:30[',
       '[48:50[', '[70:72[', '[42:44[', '[44:46[', '[30:32[', '[52:54[',
       '[56:58[', '[76:78[', '[50:52[', '[82:84[', '[84:86[', '[26:28[',
       '[18:20[', '[34:36[', '[36:38[', '[92:94[', '[88:90[', '[24:26[',
       '[20:22[', '[94:96[', '[22:24[', '[98:100[', '[16:18['],
In [12]:
array([2, 1])
In [13]:
array(['thigh', 'trunc', 'face', 'forearm', 'arm', 'leg', 'hand', nan,
       'foot', 'sole', 'finger', 'neck', 'toe', 'seat', 'scalp', 'nail',
       'trunk', 'lower limb/hip', 'hand/foot/nail', 'head/neck',
       'upper limb/shoulder'], dtype=object)
In [14]:
array(['YES', 'NO', nan], dtype=object)

So far, we've learned:

  • age is available to us in two-year intervals
  • melanoma_history is sometimes nan
  • body_site categories are not all at the same level of aggregation - for example, some slides are labeled as being from "hand/foot/nail" while some slides are labeled as from the hand, foot, and nail individually.

Let's do the following:

  • convert age to integer values
  • convert melanoma_history to dummy variables

For this first pass, we will not use body_site.

In [15]:
# define an encoder for melanoma_history

enc = OneHotEncoder(drop="first", sparse_output=False)["melanoma_history"]).reshape(-1, 1))
array(['x0_YES', 'x0_nan'], dtype=object)
In [16]:
def preprocess_feats(df=train_metadata):
    feats = df.copy()
    # take the first age in the range and convert to integer
    feats["age_int"] = feats.age.str.slice(1, 3).astype(int)
    X = pd.concat(
            feats[["age_int", "sex"]],
                enc.transform(np.array(feats["melanoma_history"]).reshape(-1, 1)),

    return X
In [17]:
# preprocess features

X = preprocess_feats(train_metadata)
age_int sex x0_YES x0_nan
1u4lhlqb.tif 32 2 1.0 0.0
rqumqnfp.tif 46 1 0.0 0.0
bu5xt1xm.tif 64 2 0.0 0.0
dibvu7wk.tif 62 2 0.0 1.0
qsza4coh.tif 90 2 0.0 0.0
... ... ... ... ...
n7jd638y.tif 68 1 0.0 1.0
8wg601oe.tif 66 1 0.0 1.0
1ejfd01b.tif 80 1 0.0 1.0
2ztrjp20.tif 42 2 0.0 1.0
hpzk3x2s.tif 56 1 0.0 1.0

1342 rows × 4 columns

In [18]:
# take labels from train_labels.csv

y = train_labels.relapse

Model training and calibration

For a first pass at training, we will have a simple stratified 80-20 split between train and test. The stratified split helps ensure that the proportion of relapse cases are similar between our train and test set and that we are performing our model training on a representative subset of the data.

In [19]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=SEED, stratify=y
In [20]:
age_int sex x0_YES x0_nan
f9oktt5o.tif 54 2 0.0 0.0
rdvuvi1x.tif 66 2 0.0 0.0
o37l6qb1.tif 58 1 0.0 1.0
2bg6adrf.tif 80 2 0.0 1.0
031viy0y.tif 28 2 0.0 0.0
... ... ... ... ...
mhcm6bc6.tif 44 2 0.0 1.0
t013dhcf.tif 58 1 0.0 0.0
t1bsrqwq.tif 74 1 0.0 1.0
jc3bw7c6.tif 32 2 0.0 1.0
m5xdtz4p.tif 58 2 0.0 1.0

1073 rows × 4 columns

Let's check the distribution of relapse cases between the two datasets to make sure that the stratification worked.

In [21]:
0    0.841566
1    0.158434
Name: relapse, dtype: float64
In [22]:
0    0.840149
1    0.159851
Name: relapse, dtype: float64

This looks like a good split, so let's go ahead and train our model! For this benchmark, we will be using a simple random forest classifier using scikit-learn.

We will also calibrate our output from the model to reflect probabilities. Calibration is important because certain models, including the random forest classifier we have chosen, output values that cannot directly be interpreted as probabilities. Calibrating is a way of making the outputs of the classifier map more closely to the actual probability distribution of the dataset. This is particularly important for the competition metric, log loss, which heavily penalizes confident but wrong predictions.

We will use another scikit-learn built-in method, CalibratedClassifierCV, to perform the calibration.

There are two methods for calibration to choose from in the CalibratedClassifierCV method, "sigmoid" and "isotonic." The sigmoid option uses Platt scaling, which uses a logistic regression model, and the isotonic option uses isotonic regression, which uses a weighted least-squares regression model. Since Platt scaling is simpler and the documentation suggests avoiding isotonic regresion for too few calibration samples, let's use the "sigmoid" option.

In [23]:
# fit a calibrated random forest model

rf = RandomForestClassifier(random_state=SEED)
calibrated_rf = CalibratedClassifierCV(rf, method="sigmoid", cv=5), y_train)
CalibratedClassifierCV(cv=5, estimator=RandomForestClassifier(random_state=42))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with

Now, let's see how well our model performs on our validation set!

Predicting on validation set

The metric for this competition is log loss, also known as cross-entropy loss. We are specifying eps=1e-16 here to match the scoring on the competition platform.

In [24]:
def score(y_true, y_pred):
    return log_loss(y_true, y_pred, eps=1e-16)
In [25]:
preds = calibrated_rf.predict_proba(X_test)[:, 1]
score(y_test, preds)

Just to compare, let's see how well our model would have performed without calibration.

In [26]:
# fit a random forest model and calculate score

rf = RandomForestClassifier(random_state=SEED), y_train)
preds = rf.predict_proba(X_test)[:, 1]
score(y_test, preds)

This is quite a difference! Keeping this in mind, let's train on our entire dataset.

Retraining on full training set

In [27]:
# train on entire dataset

rf = RandomForestClassifier(random_state=SEED)
calibrated_rf = CalibratedClassifierCV(rf, method="sigmoid", cv=5), y)
CalibratedClassifierCV(cv=5, estimator=RandomForestClassifier(random_state=42))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with

Now that we've trained our model, let's prepare our submission!

Code submission format

Since this is a code execution competition, will will submit our model weights and code rather than predictions. For more details, see the Code Submission Format page. Our first step will be to save out our encoder and the model weights for the calibrated random forest model we have just trained on the entire dataset. We will save these as .joblib files, though there are other formats you can use.

In [28]:
Path("submission/assets").mkdir(exist_ok=True, parents=True)

# save out model to assets directory in submission folder
dump(calibrated_rf, "submission/assets/random_forest_model.joblib")

# save out encoder to assets directory as well
dump(enc, "submission/assets/history_encoder.joblib")

Your submission should include a file that makes predictions for an unseen test set and outputs the predictions into a file called submission.csv that exactly matches the submission_format.csv available on the Data Download page. Your prediction function must output a single likelihood between 0.0 and 1.0 for each test set record.

Let's write out our using the preprocessing and modeling assets we saved.

In [29]:
%%writefile submission/

"""Solution for VisioMel Challenge"""

from joblib import load
from pathlib import Path

from loguru import logger
import numpy as np
import pandas as pd

DATA_ROOT = Path("/code_execution/data/")

def preprocess_feats(df, enc):
    feats = df.copy()
    feats["age_int"] = feats.age.str.slice(1, 3).astype(int)
    X = pd.concat(
            feats[["age_int", "sex"]],
                enc.transform(np.array(feats["melanoma_history"]).reshape(-1, 1)),
    return X
def main():
    # load sumission format
    submission_format = pd.read_csv(DATA_ROOT / "submission_format.csv", index_col=0)
    # load test_metadata
    test_metadata = pd.read_csv(DATA_ROOT / "test_metadata.csv", index_col=0)"Loading feature encoder and model")
    calibrated_rf = load("assets/random_forest_model.joblib")
    history_encoder = load("assets/history_encoder.joblib")"Preprocessing features")
    processed_features = preprocess_feats(test_metadata, history_encoder)
   "Checking test feature filenames are in the same order as the submission format")
    assert (processed_features.index == submission_format.index).all()"Checking test feature columns align with loaded model")
    assert (processed_features.columns == calibrated_rf.feature_names_in_).all()"Generating predictions")
    submission_format["relapse"] = calibrated_rf.predict_proba(processed_features)[:,1]

    # save as "submission.csv" in the root folder, where it is expected"Writing out submission.csv")

if __name__ == "__main__":
Writing submission/

Now, it's time to zip up our and all of our assets!

In [30]:
# zip up submission
! cd submission; zip -r ../ ./*
  adding: assets/ (stored 0%)
  adding: assets/history_encoder.joblib (deflated 52%)
  adding: assets/random_forest_model.joblib (deflated 82%)
  adding: (deflated 60%)

Before you finalize your code for submission, make sure you:

  • Include a in the root directory of your submission zip. There can be extra files with more code that is called.
  • Include any model weights in your submission zip as there will be no network access.
  • Write out a submissions.csv to the root directory when inference is finished, matching the submission format exactly.
  • Log general information that will help you debug your submission.
  • Test your submission locally and using the smoke test functionality on the platform.
  • Consider ways to optimize your pipeline so that it runs end-to-end in under 8 hours.

And make sure you do not:

  • Read from locations other than your solution directory.
  • Use information from other images in the test set in making a prediction for a given tif file.
  • Print or log any information about the test metadata or test images, including specific data values and aggregations such as sums, means, or counts.

Making a submission

To submit your .zip file, head over to the Submissions page. You will see two options for uploading your file: "Normal submission" or "smoke test". A "smoke test" is a short test against a small number of images that allows you to make sure the format for your submission is correct and that it generates predictions within a time period that is reasonable. A "normal" submission contributes toward your score on the leaderboard, and must run in under 8 hours. Please refer to the guidance in the runtime repo for this challenge to see more details.

One more point we want to check before we submit our code is that all of the libraries it requires are in the runtime. You can see more guidance about making sure that your scripts can run in the runtime environment here.

Testing a submission locally

It is a good idea to test our submission locally before we try to submit to the platform. To test your submission locally, follow the steps from the runtime repo:

  1. First, make sure you have set up the prerequisites.
  2. Download the code execution development dataset and extract it to data.
  3. Download the official competition Docker image using make pull
  4. Save all of your submission files, including the required script, in the submission_src folder of the runtime repository. Make sure any needed model weights and other assets are saved in submission_src as well.
  5. Create a submission/ file containing your code and model assets:
    $ make pack-submission 
    mkdir -p submission/
    cd submission_src; zip -r ../submission/ ./*
    adding: (deflated 73%)
  6. Launch an instance of the competition Docker image, and run the same inference process that will take place in the official runtime using make test-submission
  7. Run scripts/ on your predictions:
    python scripts/

For this notebook, we will go directly to a smoke test first. You're allowed 3 smoke tests per day.

Making a smoke test submission

When you upload a submission, you will be able to see the progress of your submission under "Status of Code Execution Jobs." A successful submission's status will change from Pending, to Starting, to Running, to Completed. Once your submission is Running, you will be able to see your logs. If you suspect your submission will run longer than 8 hours or will not succeed, you can click "Cancel current code job" and it will not count against your submission limit.

Making a leaderboard submission

Once the smoke test is Completed successfully, we can move onto making our real submission! Instead of selecting "Smoke test", select "Normal."

When we navigate to the Submissions tab, we can see our score:

Not a bad start, but there's plenty of room for improvement! We are excited to see what you come up with. Head over to the community forum to chat with other participants. Good luck!