by
Isha Shah
VisioMel Challenge: Predicting Melanoma Relapse¶
Welcome to the VisioMel Challenge: Predicting Melanoma Relapse! In this benchmark blog post, we will show you how to load the data for the challenge, make a prediction, and package your materials for a submission. Along the way, we'll include some tips that might be useful as you improve on this benchmark model and help solve a critical problem in pathology and melanoma treatment. This challenge is the successor to our TissueNet challenge, which also used Whole Slide Images. You may find some of the approaches used by winners of that challenge useful in developing your modeling approaches for this one.
from joblib import dump
from pathlib import Path
import random
from cloudpathlib import S3Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from PIL import Image
import pyvips
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
PIL will by default trigger a DecompressionBombError
for images larger than 1GB to protect against DOS attacks. Several of our slide images are larger than 1GB, so we will remove the limit on maximum allowable size.
Image.MAX_IMAGE_PIXELS = None
Loading the data¶
The data for this challenge are Whole Slide Images (WSIs), which are digitized versions of the slides pathologists examine to make a diagnosis and predict the probability of relapse. All the WSIs in this challenge are formatted as pyramidal tifs, which are images with several layers at different resolutions. For more information about WSIs and pyramidal tifs, see the Problem Description page and the Data Resources page.
Let's get started! First, let's import the metadata file for the training set (train_metadata.csv
), which is available on the Data Download page.
Metadata¶
DATA_DIR = # YOUR PATH HERE
# import train_metadata
train_metadata = pd.read_csv(DATA_DIR / "train_metadata.csv", index_col="filename")
train_metadata
The train metadata file contains a clinical variables (including age
, sex
, body_site
, melanoma_history
, breslow
, and ulceration
), fields that help verify the integrity of your image data download (tif_cksum
, tif_size
), and three URLs at which you can find the corresponding image (us_tif_url
, eu_tif_url
, and as_tif_url
).
The test set metadata and images and metadata are only accessible in the runtime container and are mounted at code_execution/data
, so test_metadata.csv
will not have the tif_cksum
, tif_size
, us_tif_url
, eu_tif_url
, and as_tif_url
fields. It will also not have breslow
or ulceration
, but these fields may still be useful in your modeling approach.
Now, let's see what the labels for the training set look like. train_labels.csv
is also available on the Data Download page.
# import train_labels.csv
train_labels = pd.read_csv(DATA_DIR / "train_labels.csv", index_col="filename")
train_labels
The field relapse
in train_labels.csv
indicates whether the slide contains a melanoma that relapsed within 5 years of initial diagnosis, where 0
indicates no relapse and 1
indicates relapse.
train_labels.relapse.value_counts(dropna=False)
Now, let's look at the images.
Images¶
The simplest way to download the images is using the AWS CLI. For more details, see the data_download_instructions.txt
file on the Data Download page. For the purposes of this benchmark, we will access the images from the US East S3 bucket using cloudpathlib.
All the WSIs are in pyramidal tif format, which we can use the Python library pyvips to visualize. The pyramidal tifs in this dataset have anywhere from 7-12 layers, which are referred to in the pyvips library as "pages". A lower page number corresponds to a higher resolution. Since the highest resolution layer of the image can be quite large, we will load a few lower-resolution pages from each image.
Extracting as much information as possible from the slides while working with their unwieldy file size is a large part of this challenge. For more tips on how to work with pyramidal tifs, see the Data Resources page. You may also find the winning solutions from our last challenge with WSIs useful. Let's see what the different pages of a single image look like.
# select a set of images to plot
SEED = 42
NUM_IMAGES = 3
# we'll use the US url
selected_image_paths = train_metadata.sample(
random_state=SEED, n=NUM_IMAGES
).us_tif_url.values
def visualize_page(page_num, image_paths):
fig, axes = plt.subplots(1, len(image_paths), figsize=(10, 10))
fig.tight_layout()
for n, image_s3_path in enumerate(image_paths):
fname = S3Path(image_s3_path).name
print(f"Downloading {fname}")
local_file = S3Path(image_s3_path).fspath
n_frames = Image.open(local_file).n_frames
img = pyvips.Image.new_from_file(local_file, page=page_num).numpy()
axes[n].set_title(f"{fname}, page={page_num}\n {img.shape}\n")
axes[n].imshow(img)
# let's look at a few high-resolution pages from our selected images
visualize_page(3, selected_image_paths)
# let's look at low-resolution pages from the same images
visualize_page(8, selected_image_paths)
Note the difference in the dimensions (the numbers along the left and bottom edges) of the higher versus lower pages for these images. There is also variation in the shapes, sizes, and colors of these slides. Some images also have artifacts that are unrelated to the task of prediction, such as black marks from the slide holder or blurry regions. The Data Resources page provides some more information on the variations that your solution will have to take into account.
The organizers of the challenge, VisioMel, have also provided annotated versions of a few example images in annotations.csv
and visualized_annotations.pdf
, which are available on the Data Download page.
Training a model¶
Since the purpose of this blog post is to show how to access the data and prepare and submit your model files, we will be using a very simple random forest model that uses only a few of the clinical variables available to us in both the train and test metadata. A successful model will need to use the images.
Preprocessing¶
To start, let's do some feature engineering! We'll first take a look at the unique values for each variable in the metadata. These variables are defined on the Problem Description page.
train_metadata.age.unique()
train_metadata.sex.unique()
train_metadata.body_site.unique()
train_metadata.melanoma_history.unique()
So far, we've learned:
age
is available to us in two-year intervalsmelanoma_history
is sometimesnan
body_site
categories are not all at the same level of aggregation - for example, some slides are labeled as being from "hand/foot/nail" while some slides are labeled as from the hand, foot, and nail individually.
Let's do the following:
- convert
age
to integer values - convert
melanoma_history
to dummy variables
For this first pass, we will not use body_site
.
# define an encoder for melanoma_history
enc = OneHotEncoder(drop="first", sparse_output=False)
enc.fit(np.array(train_metadata["melanoma_history"]).reshape(-1, 1))
enc.get_feature_names_out()
def preprocess_feats(df=train_metadata):
feats = df.copy()
# take the first age in the range and convert to integer
feats["age_int"] = feats.age.str.slice(1, 3).astype(int)
X = pd.concat(
[
feats[["age_int", "sex"]],
pd.DataFrame(
enc.transform(np.array(feats["melanoma_history"]).reshape(-1, 1)),
columns=enc.get_feature_names_out(),
index=feats.index,
),
],
axis=1,
)
return X
# preprocess features
X = preprocess_feats(train_metadata)
X
# take labels from train_labels.csv
y = train_labels.relapse
Model training and calibration¶
For a first pass at training, we will have a simple stratified 80-20 split between train and test. The stratified split helps ensure that the proportion of relapse cases are similar between our train and test set and that we are performing our model training on a representative subset of the data.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=SEED, stratify=y
)
X_train
Let's check the distribution of relapse cases between the two datasets to make sure that the stratification worked.
y_train.value_counts(normalize=True)
y_test.value_counts(normalize=True)
This looks like a good split, so let's go ahead and train our model! For this benchmark, we will be using a simple random forest classifier using scikit-learn.
We will also calibrate our output from the model to reflect probabilities. Calibration is important because certain models, including the random forest classifier we have chosen, output values that cannot directly be interpreted as probabilities. Calibrating is a way of making the outputs of the classifier map more closely to the actual probability distribution of the dataset. This is particularly important for the competition metric, log loss, which heavily penalizes confident but wrong predictions.
We will use another scikit-learn built-in method, CalibratedClassifierCV
, to perform the calibration.
There are two methods for calibration to choose from in the CalibratedClassifierCV
method, "sigmoid" and "isotonic." The sigmoid option uses Platt scaling, which uses a logistic regression model, and the isotonic option uses isotonic regression, which uses a weighted least-squares regression model. Since Platt scaling is simpler and the documentation suggests avoiding isotonic regresion for too few calibration samples, let's use the "sigmoid" option.
# fit a calibrated random forest model
rf = RandomForestClassifier(random_state=SEED)
calibrated_rf = CalibratedClassifierCV(rf, method="sigmoid", cv=5)
calibrated_rf.fit(X_train, y_train)
Now, let's see how well our model performs on our validation set!
Predicting on validation set¶
The metric for this competition is log loss, also known as cross-entropy loss. We are specifying eps=1e-16
here to match the scoring on the competition platform.
def score(y_true, y_pred):
return log_loss(y_true, y_pred, eps=1e-16)
preds = calibrated_rf.predict_proba(X_test)[:, 1]
score(y_test, preds)
Just to compare, let's see how well our model would have performed without calibration.
# fit a random forest model and calculate score
rf = RandomForestClassifier(random_state=SEED)
rf.fit(X_train, y_train)
preds = rf.predict_proba(X_test)[:, 1]
score(y_test, preds)
This is quite a difference! Keeping this in mind, let's train on our entire dataset.
Retraining on full training set¶
# train on entire dataset
rf = RandomForestClassifier(random_state=SEED)
calibrated_rf = CalibratedClassifierCV(rf, method="sigmoid", cv=5)
calibrated_rf.fit(X, y)
Now that we've trained our model, let's prepare our submission!
Code submission format¶
Since this is a code execution competition, will will submit our model weights and code rather than predictions. For more details, see the Code Submission Format page. Our first step will be to save out our encoder and the model weights for the calibrated random forest model we have just trained on the entire dataset. We will save these as .joblib
files, though there are other formats you can use.
Path("submission/assets").mkdir(exist_ok=True, parents=True)
# save out model to assets directory in submission folder
dump(calibrated_rf, "submission/assets/random_forest_model.joblib")
# save out encoder to assets directory as well
dump(enc, "submission/assets/history_encoder.joblib")
Your submission should include a main.py
file that makes predictions for an unseen test set and outputs the predictions into a file called submission.csv
that exactly matches the submission_format.csv
available on the Data Download page. Your prediction function must output a single likelihood between 0.0 and 1.0 for each test set record.
Let's write out our main.py
using the preprocessing and modeling assets we saved.
%%writefile submission/main.py
"""Solution for VisioMel Challenge"""
from joblib import load
from pathlib import Path
from loguru import logger
import numpy as np
import pandas as pd
DATA_ROOT = Path("/code_execution/data/")
def preprocess_feats(df, enc):
feats = df.copy()
feats["age_int"] = feats.age.str.slice(1, 3).astype(int)
X = pd.concat(
[
feats[["age_int", "sex"]],
pd.DataFrame(
enc.transform(np.array(feats["melanoma_history"]).reshape(-1, 1)),
columns=enc.get_feature_names_out(),
index=feats.index,
),
],
axis=1,
)
return X
def main():
# load sumission format
submission_format = pd.read_csv(DATA_ROOT / "submission_format.csv", index_col=0)
# load test_metadata
test_metadata = pd.read_csv(DATA_ROOT / "test_metadata.csv", index_col=0)
logger.info("Loading feature encoder and model")
calibrated_rf = load("assets/random_forest_model.joblib")
history_encoder = load("assets/history_encoder.joblib")
logger.info("Preprocessing features")
processed_features = preprocess_feats(test_metadata, history_encoder)
logger.info("Checking test feature filenames are in the same order as the submission format")
assert (processed_features.index == submission_format.index).all()
logger.info("Checking test feature columns align with loaded model")
assert (processed_features.columns == calibrated_rf.feature_names_in_).all()
logger.info("Generating predictions")
submission_format["relapse"] = calibrated_rf.predict_proba(processed_features)[:,1]
# save as "submission.csv" in the root folder, where it is expected
logger.info("Writing out submission.csv")
submission_format.to_csv("submission.csv")
if __name__ == "__main__":
main()
Now, it's time to zip up our main.py
and all of our assets!
# zip up submission
! cd submission; zip -r ../submission.zip ./*
Before you finalize your code for submission, make sure you:
- Include a
main.py
in the root directory of your submission zip. There can be extra files with more code that is called. - Include any model weights in your submission zip as there will be no network access.
- Write out a
submissions.csv
to the root directory when inference is finished, matching the submission format exactly. - Log general information that will help you debug your submission.
- Test your submission locally and using the smoke test functionality on the platform.
- Consider ways to optimize your pipeline so that it runs end-to-end in under 8 hours.
And make sure you do not:
- Read from locations other than your solution directory.
- Use information from other images in the test set in making a prediction for a given tif file.
- Print or log any information about the test metadata or test images, including specific data values and aggregations such as sums, means, or counts.
Making a submission¶
To submit your .zip
file, head over to the Submissions page. You will see two options for uploading your file: "Normal submission" or "smoke test". A "smoke test" is a short test against a small number of images that allows you to make sure the format for your submission is correct and that it generates predictions within a time period that is reasonable. A "normal" submission contributes toward your score on the leaderboard, and must run in under 8 hours. Please refer to the guidance in the runtime repo for this challenge to see more details.
One more point we want to check before we submit our code is that all of the libraries it requires are in the runtime. You can see more guidance about making sure that your scripts can run in the runtime environment here.
Testing a submission locally¶
It is a good idea to test our submission locally before we try to submit to the platform. To test your submission locally, follow the steps from the runtime repo:
- First, make sure you have set up the prerequisites.
- Download the code execution development dataset and extract it to
data
. - Download the official competition Docker image using
make pull
- Save all of your submission files, including the required
main.py
script, in thesubmission_src
folder of the runtime repository. Make sure any needed model weights and other assets are saved insubmission_src
as well. - Create a
submission/submission.zip
file containing your code and model assets:$ make pack-submission mkdir -p submission/ cd submission_src; zip -r ../submission/submission.zip ./* adding: solution.py (deflated 73%)
- Launch an instance of the competition Docker image, and run the same inference process that will take place in the official runtime using
make test-submission
- Run
scripts/score.py
on your predictions:python scripts/score.py
For this notebook, we will go directly to a smoke test first. You're allowed 3 smoke tests per day.
Making a smoke test submission¶
When you upload a submission, you will be able to see the progress of your submission under "Status of Code Execution Jobs." A successful submission's status will change from Pending
, to Starting
, to Running
, to Completed
. Once your submission is Running
, you will be able to see your logs. If you suspect your submission will run longer than 8 hours or will not succeed, you can click "Cancel current code job" and it will not count against your submission limit.
Making a leaderboard submission¶
Once the smoke test is Completed
successfully, we can move onto making our real submission! Instead of selecting "Smoke test", select "Normal."
When we navigate to the Submissions tab, we can see our score:
Not a bad start, but there's plenty of room for improvement! We are excited to see what you come up with. Head over to the community forum to chat with other participants. Good luck!