Mars Spectrometry 2: Gas Chromatography - Benchmark

by Isha Shah, Jay Qi

Welcome to the benchmark notebook for Mars Spectrometry 2: Gas Chromatography! The problem description for this challenge can be found here. The purpose of this notebook is to provide an example of how to use the data to create a basic benchmark model and to demonstrate how to submit your work.

Did Mars ever have livable environmental conditions? This is one of the key questions in the field of planetary science. NASA missions like the Curiosity and Perseverance rovers carry a rich array of scientific instruments to collect data that can help answer this question, including instruments that gather and analyze rock and soil samples. Using a method of chemical analysis called gas chromatography-mass spectrometry (GCMS), the samples can be analyzed for their chemical makeup, which can indicate whether the environment's conditions could have been amenable to life.

In this challenge, your goal is to build a model to automatically analyze GCMS data to detect the presence of nine families of chemical compounds related to understanding Mars' potential for past habitability. This challenge is a sequel to our first Mars Spectrometry challenge.

The winning techniques from this competition may be used to help analyze data from Mars, and potentially even inform future designs for planetary mission instruments performing in-situ analysis. In other words, one day your model might literally be out-of-this-world!

In this notebook, we'll cover:

  • Exploratory data analysis
  • Preprocessing
  • Feature engineering
  • Training a logistic regression model and a convolutional neural network
  • Preparing a submission

Let's get started!

In [1]:
import itertools
from pathlib import Path
from pprint import pprint
import os
import shutil

from matplotlib import pyplot as plt, cm
import numpy as np
import pandas as pd
from pandas_path import path
from PIL import Image
from sklearn.dummy import DummyClassifier
from sklearn.preprocessing import minmax_scale
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.metrics import make_scorer, log_loss
from sklearn.model_selection import StratifiedKFold, cross_val_score
from tqdm import tqdm

import torch
import flash
from flash.image import ImageClassificationData, ImageClassifier
from flash.core.classification import ProbabilitiesOutput
from import DataLoader
from pytorch_lightning import seed_everything

pd.set_option("max_colwidth", 80)
RANDOM_SEED = 42  # For reproducibility
/Users/ishashah/opt/anaconda3/envs/competition-nasa-mars-gcms/lib/python3.8/site-packages/tqdm/ TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See
  from .autonotebook import tqdm as notebook_tqdm
/Users/ishashah/opt/anaconda3/envs/competition-nasa-mars-gcms/lib/python3.8/site-packages/torchvision/models/ UserWarning: Accessing the model URLs via the internal dictionary of the module is deprecated since 0.13 and will be removed in 0.15. Please access them via the appropriate Weights Enum instead.

Importing datasets

In this competition, there are three datasets:

  • the training set, for which both features and labels are provided;
  • the validation set, for which you will submit predictions, and your score will be displayed on the leaderboard while the competition is open
  • the test set, for which you will also submit predictions, and your score will be used in determining the final rankings

This competition also has two phases: an initial development phase for the first half and a final training phase for the second half. In the second phase, the labels for the validation set will be released so that the validation set can be used as additional training data.

We will therefore perform light exploratory data analysis on the data available to us at the beginning of the competition - the features and labels for the training set. metadata.csv contains information about each sample included in the dataset, including whether it's a part of training, validation, or testing dataset.

In [2]:
PROJ_ROOT = Path.cwd().parent
In [3]:
DATA_PATH = PROJ_ROOT / "data/final/public/"
metadata = pd.read_csv(DATA_PATH / "metadata.csv", index_col="sample_id")
split derivatized features_path features_md5_hash
S0000 train NaN train_features/S0000.csv 52ec6d6f8372500ab4e069b5fbdae6f9
S0001 train NaN train_features/S0001.csv 348f90baed8a8189bf0d4c7b9ed9f965
S0002 train 1.0 train_features/S0002.csv 4686ad9bc3716966f63b6ff83d1d8324
S0003 train NaN train_features/S0003.csv de6b53605c5887967dc3661a3a711c2b
S0004 train NaN train_features/S0004.csv fbfd90092d10d15a5d6919327ddde2ab
In [4]:
train_files = metadata[metadata["split"] == "train"]["features_path"].to_dict()
val_files = metadata[metadata["split"] == "val"]["features_path"].to_dict()
test_files = metadata[metadata["split"] == "test"]["features_path"].to_dict()

print("Number of training samples: ", len(train_files))
print("Number of validation samples: ", len(val_files))
print("Number of testing samples: ", len(test_files))
Number of training samples:  809
Number of validation samples:  312
Number of testing samples:  463

Exploratory Data Analysis

The data for this competition is composed of gas chromatography mass spectra from laboratory instruments at NASA's Goddard Space Flight Center that are operated by the Sample Analysis at Mars (SAM) science team. A full description of the fields included in the data can be found in the problem description. The data for this challenge will likely feel familiar to participants in the first Mars Spectrometry challenge, as it is also mass spectrometry data.

First, let's visualize our mass spectra to see how these fields—time, mass, and intensity—relate to each other. Participants from the first Mars Spectrometry challenge may notice that we don't have a temperature column in the dataset. See the Problem Description for more on this. In this challenge, we'll be using time the way that temperature was used in the first challenge.

In [5]:
# Select sample IDs for five samples
sample_ids_ls = metadata.index.values[0:5]

# Import datasets for EDA
sample_data_dict = {}

for sample_id in sample_ids_ls:
    sample_data_dict[sample_id] = pd.read_csv(DATA_PATH / train_files[sample_id])
In [6]:
def plot_spectrogram(sample_df, sample_lab):

    # For visual clarity, we will round these intensity values to the nearest whole number and average the intensity.
    sample_df["mass"] = sample_df["mass"].round()
    sample_df = (
        sample_df.groupby(["time", "mass"])["intensity"].aggregate("mean").reset_index()

    for m in sample_df["mass"].unique():
            sample_df[sample_df["mass"] == m]["time"],
            sample_df[sample_df["mass"] == m]["intensity"],

In [7]:
fig, ax = plt.subplots(1, 5, figsize=(15, 3), constrained_layout=True)

for i in range(0, 5):
    sample_lab = sample_ids_ls[i]
    sample_df = sample_data_dict[sample_lab]

    plt.subplot(1, 5, i + 1)
    plot_spectrogram(sample_df, sample_lab)

There are a few observations we can make from the above:

  • There is a diversity in the time range of experiments (some are as short as 7 minutes, while others extend to nearly 40), and the intensity ranges differ as well.
  • There are different m/z ion types that peak at different times. (Each colored line represents a distinct m/z value).
  • For some samples, there are other ion types that are consistently present at low abundance levels. These are likely background abundances that are irrelevant. (In general, a spectrogram is useful for the peaks in abundances it shows at different times).

There is one more feature we have available in metadata that might be useful to our prediction task—derivatized. As described in the "Preprocessing Samples" portion of the problem description, some experiments are run after the sample is treated with a derivatization agent. Let's take a quick look at this feature.

In [8]:
NaN    1023
1.0     561
Name: derivatized, dtype: int64

It seems that the derivatized is only ever 1 or NaN, indicating that we know which samples that are confirmed derivatized, but we do not know whether the remaining samples are definitely not derivatized. For additional information on how derivatization can affect a sample's mass spectrogram, see the problem description page.

Finally, let's also see what the distribution of compounds is in our training data.

In [9]:
train_labels = pd.read_csv(DATA_PATH / "train_labels.csv", index_col="sample_id")
target_cols = train_labels.columns
aromatic hydrocarbon carboxylic_acid nitrogen_bearing_compound chlorine_bearing_compound sulfur_bearing_compound alcohol other_oxygen_bearing_compound mineral
S0000 0 0 0 0 0 0 0 0 1
S0001 0 0 0 0 0 0 0 0 0
S0002 0 0 1 1 0 0 0 0 1
S0003 0 1 0 0 0 0 0 0 0
S0004 0 0 0 0 1 0 0 0 1
In [10]:
train_labels.aggregate("sum", axis=1).value_counts(normalize=True)
1    0.473424
0    0.374536
2    0.071693
3    0.046972
4    0.016069
8    0.013597
5    0.003708
dtype: float64
In [11]:
sumlabs = train_labels.aggregate("sum").sort_values()

plt.barh(sumlabs.index, sumlabs, align="center")
plt.xlabel("Count in training set")
plt.title("Compounds represented in training set")
Text(0.5, 1.0, 'Compounds represented in training set')

We can make a few observations here:

  1. A large share (about 37 percent) of our samples do not have a positive label for any of the classes, and
  2. For some classes, such as chlorine-bearing compounds, other oxygen-bearing compounds, alcohols, and sulfur-bearing compounds, we have very few samples (less than 50).


From the "Understanding GCMS Data" section of the project description page, we know that there are several preprocessing steps that will likely be helpful before working with the data. We'll demonstrate each step on a single sample to visualize the changes.

In [12]:
# Selecting a sample to demonstrate preprocessing steps

sample_df = pd.read_csv(
    DATA_PATH / metadata[metadata.split == "train"].features_path.iloc[0]
time mass intensity
0 0.042183 3.235397 115568.0
1 0.042183 4.151337 1196446.0
2 0.042183 5.198563 1259.0
3 0.042183 5.911224 300.0
4 0.042183 8.230881 15850.0
... ... ... ...
185783 7.004417 531.052124 117.0
185784 7.004417 531.805786 638.0
185785 7.004417 532.903076 562.0
185786 7.004417 533.651489 95.0
185787 7.004417 534.307983 66.0

185788 rows × 3 columns

Let's also calculate a few summary statistics on our dataset to have a sense of how these preprocessing steps will affect it.

In [13]:
# Calculate summary statistics for time and m/z values for training set

def get_time_mass_stats(fpath):

    df = pd.read_csv(DATA_PATH / fpath)

    time_min = df["time"].min()
    time_max = df["time"].max()
    time_range = time_max - time_min

    mass_min = df["mass"].min()
    mass_max = df["mass"].max()
    mass_range = mass_max - mass_min

    return time_min, time_max, time_range, mass_min, mass_max, mass_range
In [14]:
sample_paths_ls = metadata[metadata["split"] == "train"].features_path
training_stats_df = pd.DataFrame(sample_paths_ls)
training_stats_df.columns = ["fpath"]
In [15]:
) = zip(*training_stats_df["fpath"].progress_apply(get_time_mass_stats))
100%|██████████████████████████████████████████████████████████████████████████████| 809/809 [01:40<00:00,  8.04it/s]
In [16]:
time_min time_max time_range mass_min mass_max mass_range
count 809.000000 809.000000 809.000000 809.000000 809.000000 809.000000
mean 3.091072 22.295400 19.204328 29.389698 509.456048 480.066350
std 4.702052 12.054895 9.992512 19.193493 44.998271 56.512800
min 0.041917 5.704683 5.662500 1.125473 434.958679 389.767929
25% 0.042183 14.023700 12.495050 4.278992 434.975464 389.833603
50% 0.042450 18.502867 15.502350 45.128860 534.966431 489.844513
75% 3.000783 33.003900 27.307550 45.176620 534.968994 530.695374
max 16.001833 48.352867 48.310684 50.194916 649.962158 599.822677

1. Standardizing which m/z values to include

As we can see in the summary statistics above and described in the problem description page, there is some heterogeneity in the m/z values measured for each sample. The range of m/z values being measured is not the same for every sample, and that most m/z values are fractional. In this section, we will make some basic simplifying assumptions to standardize which m/z values to include in our analysis. We have picked a threshold primarily to reduce dimensionality and make it easier for us to illustrate how to engineer features and train a model quickly, rather than one that is informed by the data or by domain knowledge. We encourage participants to devise their own approaches to preprocessing the data. In this example, we will discard any m/z values above 350, and we will round the fractional m/z values to their nearest integer and average their abundance values.

Additionally, as discussed on the problem description page, samples may have a high level of ions for m/z=4.0, which corresponds to helium carrier gas. The helium is not relevant to the composition of the sample, so we will also discard it as simplification. A more sophisticated analysis could potentially use it to understand how other ions compare to it relatively.

In [17]:
def drop_frac_and_He(df):
    Rounds fractional m/z values, drops m/z values > 350, and drops carrier gas m/z

        df: a dataframe representing a single sample, containing m/z values

        The dataframe without fractional m/z and carrier gas m/z

    # rounds m/z fractional values
    df["rounded_mass"] = df["mass"].transform(round)

    # aggregates across rounded values
    df = df.groupby(["time", "rounded_mass"])["intensity"].aggregate("mean").reset_index()

    # drop m/z values greater than 350
    df = df[df["rounded_mass"] <= 350]

    # drop carrier gas
    df = df[df["rounded_mass"] != 4]

    return df
In [18]:
sample_df = drop_frac_and_He(sample_df)

2. Removing background ion presences

As discussed in the problem description, there will be background levels of ions in the measurements that aren't relevant to the analysis. While there are more complex and nuanced ways to remove the background noise, we will simply subtract the minimum abundance value. As mentioned in the project description, scientists may remove background noise more carefully. They may take an average of an area early in the experiment to subtract. Or, if the background noise varies over time, they may fit a function to it and subtract according to this function.

In [19]:
def remove_background_intensity(df):
    Subtracts minimum abundance value

        df: dataframe with 'mass' and 'intensity' columns

        dataframe with minimum abundance subtracted for all observations

    df["intensity_minsub"] = df.groupby(["rounded_mass"])["intensity"].transform(
        lambda x: (x - x.min())

    return df
In [20]:
# Intensity values before subtracting minimum
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle("Intensity values across temperature by m/z")
fig.supylabel("Abundance levels")

plt.subplot(1, 2, 1)
for m in sample_df["rounded_mass"].unique():
        sample_df[sample_df["rounded_mass"] == m]["time"],
        sample_df[sample_df["rounded_mass"] == m]["intensity"],

plt.title("Before subtracting minimum intensity")

# After subtracting minimum intensity value
sample_df = remove_background_intensity(sample_df)

plt.subplot(1, 2, 2)

for m in sample_df["rounded_mass"].unique():
        sample_df[sample_df["rounded_mass"] == m]["time"],
        sample_df[sample_df["rounded_mass"] == m]["intensity_minsub"],
plt.title("After subtracting minimum intensity")
Text(0.5, 1.0, 'After subtracting minimum intensity')

As we can see, simply subtracting the minimum intensity for this particular sample did not make a difference. There are likely better ways of accounting for noise in the data that are more sophisticated.

3. Convert abundances to relative abundances

Since the relative abundances of various ions are more informative than their absolute values, we normalize the abundance values from 0 to 1 within a single sample.

In [21]:
def scale_intensity(df):
    Scale abundance from 0-1 according to the min and max values across entire sample

        df: dataframe containing abundances and m/z

        dataframe with additional column of scaled abundances

    df["int_minsub_scaled"] = minmax_scale(df["intensity_minsub"].astype(float))

    return df

Putting it all together

Finally, let's revisit what our samples look like after all of the preprocessing steps have been completed.

In [22]:
# Preprocess function
def preprocess_sample(df):
    df = drop_frac_and_He(df)
    df = remove_background_intensity(df)
    df = scale_intensity(df)
    return df
In [23]:
fig, ax = plt.subplots(1, 5, figsize=(15, 3), constrained_layout=True)
fig.suptitle("Commercial samples")

for i in range(0, 5):
    sample_lab = sample_ids_ls[i]
    sample_df = sample_data_dict[sample_lab]
    sample_df = preprocess_sample(sample_df)

    plt.subplot(1, 5, i + 1)
    for m in sample_df["rounded_mass"].unique():
            sample_df[sample_df["rounded_mass"] == m]["time"],
            sample_df[sample_df["rounded_mass"] == m]["int_minsub_scaled"],


Feature engineering

When scientists analyze GCMS data, they look closely at the characteristics of peaks in ion abundances, such as the time at which they occur, the shape of the peaks including the height and area. For our analysis, we will engineer a simple set of features that try to capture some of these characteristics. We will discretize the overall time range into bins (since we do not have temperature information, time is the closest proxy), and calculate maximum relative intensity within that time bin for each m/z value. This is the same approach we used in our benchmark model for the previous challenge. Here, we will create 30-second time bins from 0 to 25 minutes. As with our selection of which m/z values to include, our choice of time bins and time range is meant to reduce dataset dimensionality in order to run this example quickly and without much computational expense, rather than being informed by the data or domain knowledge.

There are many ways to describe the shapes of these peaks with higher fidelity than the approach we demonstrate here. It will also be important to deal with complex cases such as peaks that overlap due to two different gas releases producing the same ions at around the same time. For inspiration, you may wish to consult the winners' code and write-ups from the previous challenge. We look forward to seeing what you all come up with in your own feature engineering.

In [24]:
# Create a series of time bins
timerange = pd.interval_range(start=0, end=25, freq=0.5)

# Make dataframe with rows that are combinations of all temperature bins and all m/z values
allcombs = list(itertools.product(timerange, [*range(0, 350)]))

allcombs_df = pd.DataFrame(allcombs, columns=["time_bin", "rounded_mass"])
time_bin rounded_mass
0 (0.0, 0.5] 0
1 (0.0, 0.5] 1
2 (0.0, 0.5] 2
3 (0.0, 0.5] 3
4 (0.0, 0.5] 4

We are expecting 350 m/z values * 50 time bins = 17,500 features to be generated per sample.

In [25]:
def int_per_timebin(df):

    Transforms dataset to take the preprocessed max abundance for each
    time range for each m/z value

        df: dataframe to transform

        transformed dataframe

    # Bin times
    df["time_bin"] = pd.cut(df["time"], bins=timerange)

    # Combine with a list of all time bin-m/z value combinations
    df = pd.merge(allcombs_df, df, on=["time_bin", "rounded_mass"], how="left")

    # Aggregate to time bin level to find max
    df = df.groupby(["time_bin", "rounded_mass"]).max("int_minsub_scaled").reset_index()

    # Fill in 0 for intensity values without information
    df = df.replace(np.nan, 0)

    # Reshape so each row is a single sample
    df = df.pivot_table(
        columns=["rounded_mass", "time_bin"], values=["int_minsub_scaled"]

    return df
In [26]:
# Assembling preprocessed and transformed training set

train_features_dict = {}
print("Total number of train files: ", len(train_files))

for i, (sample_id, filepath) in enumerate(tqdm(train_files.items())):

    # Load training sample
    temp = pd.read_csv(DATA_PATH / filepath)

    # Preprocessing training sample
    train_sample_pp = preprocess_sample(temp)

    # Feature engineering
    train_sample_fe = int_per_timebin(train_sample_pp).reset_index(drop=True)
    train_features_dict[sample_id] = train_sample_fe

train_features = pd.concat(
    train_features_dict, names=["sample_id", "dummy_index"]
).reset_index(level="dummy_index", drop=True)
Total number of train files:  809
100%|██████████████████████████████████████████████████████████████████████████████| 809/809 [15:44<00:00,  1.17s/it]
In [27]:
rounded_mass 0 ... 349
time_bin (0.0, 0.5] (0.5, 1.0] (1.0, 1.5] (1.5, 2.0] (2.0, 2.5] (2.5, 3.0] (3.0, 3.5] (3.5, 4.0] (4.0, 4.5] (4.5, 5.0] ... (20.0, 20.5] (20.5, 21.0] (21.0, 21.5] (21.5, 22.0] (22.0, 22.5] (22.5, 23.0] (23.0, 23.5] (23.5, 24.0] (24.0, 24.5] (24.5, 25.0]
S0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0001 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0002 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000015 0.000020 0.000014 0.000013 0.000015 0.000017 0.000018 0.000012 0.000011 0.000012
S0003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000004 0.000004 0.000002 0.000003 0.000006 0.000006 0.000005 0.000004 0.000004 0.000006

5 rows × 17500 columns

In [28]:
# Make sure that all sample IDs in features and labels are identical
assert train_features.index.equals(train_labels.index)

Perform modeling

This competition's task is a multi-label classification problem with 9 label classes—each observation can belong to any number of the label classes. One simple modeling approach for multi-label classification is "one vs. all", in which we create a binary classifier for each label class independently. Then, each binary classifier's predictions are simply concatenated together at the end for the overall prediction. For this benchmark, we will use logistic regression for each classifier as a first-pass modeling approach.

In order to have a reference point for our model, it can help to have a naive baseline. This challenge's metric is aggregated log-loss. Log-loss has the property that it is optimized by statistically calibrated predicted probabilities. Therefore, a good baseline model to use is to predict the proportion of each label class in the training set.

We will also use K-fold cross-validation to evaluate our model, rather than just a simple train-test split. K-fold cross-validation makes use of all of the data we have available for training and evaluation and is a useful technique for cases of having a relatively small amount of labeled data. When generating the cross-validation folds, we will also stratify according to class to make sure that there is even representation of each class for each fold. We will set a random seed in order to use the same cross-validation folds for both the dummy baseline model and the logistic regression model for an apples-to-apples comparison.

In [29]:
# Define stratified k-fold validation
skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

# Define log loss
log_loss_scorer = make_scorer(log_loss, needs_proba=True)

Baseline dummy classifier

In [30]:
# Check log loss score for baseline dummy model
def logloss_cross_val(clf, X, y):

    # Generate a score for each label class
    log_loss_cv = {}
    for col in y.columns:

        y_col = y[col]  # take one label at a time
        log_loss_cv[col] = np.mean(
            cross_val_score(clf, X.values, y_col, cv=skf, scoring=log_loss_scorer)

    avg_log_loss = np.mean(list(log_loss_cv.values()))

    return log_loss_cv, avg_log_loss
In [31]:
# Dummy classifier
dummy_clf = DummyClassifier(strategy="prior")

print("Dummy model cross-validation average log-loss:")
dummy_logloss = logloss_cross_val(dummy_clf, train_features, train_labels[target_cols])
print("\nAggregate log-loss")
Dummy model cross-validation average log-loss:
{'alcohol': 0.14214942747285955,
 'aromatic': 0.31989530853732984,
 'carboxylic_acid': 0.40885054104150764,
 'chlorine_bearing_compound': 0.10676820510533594,
 'hydrocarbon': 0.6324429889239704,
 'mineral': 0.4132828599514662,
 'nitrogen_bearing_compound': 0.3642707864294015,
 'other_oxygen_bearing_compound': 0.12046121550466556,
 'sulfur_bearing_compound': 0.16260642339623016}

Aggregate log-loss

Now that we have a log-loss value from a very naive approach, we have a baseline against which to compare the performance of future approaches.

Logistic regression model

For our logistic regression models, we will also use lasso (a.k.a. L1) regularization, which will reduce the number of features used in the prediction. This will be helpful for the much greater number of features (17,500) than observations. More sophisticated dimensionality reduction techniques, such as principal component analysis (PCA), may be useful, but we will just stick with a simple approach for this benchmark. This is also the same as the benchmark model for the previous competition.

We've found that a regularization weight parameter of C=2 seems to work okay. To further tune this model, we would recommend using a cross-validated hyperparameter optimizer.

In [32]:
# Define Lasso model
logreg_clf = LogisticRegression(penalty="l1", solver="liblinear", C=2)
print("Logistic regression model cross-validation average log-loss:\n")
logreg_logloss = logloss_cross_val(logreg_clf, train_features, train_labels[target_cols])
print("Aggregate log-loss")
Logistic regression model cross-validation average log-loss:

{'alcohol': 0.11342912712762176,
 'aromatic': 0.2379432910544823,
 'carboxylic_acid': 0.26706790063870767,
 'chlorine_bearing_compound': 0.08063769677441938,
 'hydrocarbon': 0.24141374391053233,
 'mineral': 0.24241004115182135,
 'nitrogen_bearing_compound': 0.22322762427634507,
 'other_oxygen_bearing_compound': 0.08595019024218636,
 'sulfur_bearing_compound': 0.10409595076572405}
Aggregate log-loss

Training the model on all of the data

Now that we've used cross-validation to get an idea of our performance, we want to use all of the training data to train our model.

In [33]:
# Train logistic regression model with l1 regularization, where C = 2
def logreg_train(X_train, y_train):

    # Initialize dict to hold fitted models
    logreg_model_dict = {}

    # Split into binary classifier for each class
    for col in y_train.columns:

        y_train_col = y_train[col]  # Train on one class at a time

        # Output the trained model, bind this to a var, then use as input to prediction function
        clf = LogisticRegression(penalty="l1", solver="liblinear", C=2, random_state=42)

        logreg_model_dict[col] =, y_train_col)  # Train

    return logreg_model_dict
In [34]:
fitted_logreg_dict = logreg_train(train_features, train_labels[target_cols])

Preparing a submission

Submissions must follow the format described in the competition guidelines here. We will apply our preprocessing and feature engineering steps on the test features, apply our trained model to them to generate predictions, and shape them into the correct format for submission.

In [37]:
# Create dict with both validation and test sample IDs and paths
all_test_files = val_files
print("Total test files: ", len(all_test_files))
Total test files:  775
In [38]:
# Import submission format
submission_template_df = pd.read_csv(
    DATA_PATH / "submission_format.csv", index_col="sample_id"
compounds_order = submission_template_df.columns
sample_order = submission_template_df.index
In [39]:
def predict_for_sample(sample_id, fitted_model_dict):

    # Import sample
    temp_sample = pd.read_csv(DATA_PATH / all_test_files[sample_id])

    # Preprocess sample
    temp_sample = preprocess_sample(temp_sample)

    # Feature engineering on sample
    temp_sample = int_per_timebin(temp_sample)

    # Generate predictions for each class
    temp_sample_preds_dict = {}

    for compound in compounds_order:
        clf = fitted_model_dict[compound]
        temp_sample_preds_dict[compound] = clf.predict_proba(temp_sample.values)[:, 1][0]

    return temp_sample_preds_dict
In [44]:
# Dataframe to store logreg submissions in
final_submission_df = pd.DataFrame(
        predict_for_sample(sample_id, fitted_logreg_dict)
        for sample_id in tqdm(sample_order)
100%|██████████████████████████████████████████████████████████████████████████████| 775/775 [27:12<00:00,  2.11s/it]
In [45]:
aromatic hydrocarbon carboxylic_acid nitrogen_bearing_compound chlorine_bearing_compound sulfur_bearing_compound alcohol other_oxygen_bearing_compound mineral
S0809 0.476621 0.475221 0.559667 0.592516 0.364717 0.548806 0.400521 0.331917 0.401229
S0810 0.028368 0.015957 0.000644 0.001458 0.003850 0.002616 0.002213 0.005221 0.962852
S0811 0.068873 0.079225 0.138025 0.110880 0.006039 0.011309 0.012507 0.010978 0.475257
S0812 0.288401 0.104584 0.203581 0.146409 0.127592 0.073890 0.093282 0.120167 0.120749
S0813 0.170273 0.037067 0.004696 0.002916 0.004856 0.002137 0.001451 0.004729 0.795318
In [46]:
# Check that columns and rows are the same between final submission and submission format
assert final_submission_df.index.equals(submission_template_df.index)
assert final_submission_df.columns.equals(submission_template_df.columns)

The shape and layout of our prediction dataframe looks good! Each row is a unique sample and each column corresponds to a class. We will now generate a csv to submit.

In [47]:

Upload submission

We can now head over to the competition submissions page to submit our predictions and see how we score! Click on "Make new submission" and upload your formatted prediction CSV file.

This is what we see when we submit our logistic regression-based predictions.

Screen Shot 2022-09-07 at 2.25.29 PM.png

Our logistic regression model got a 0.2278 on the validation set. On the leaderboard, you can see that the naive model (named as "Training Set Prior Distributions") scored 0.2824. Not a bad start! Now, let's try one last approach.

Simple CNN model

In our previous Mars Spectrometry competition, the first-place winner used a neural network-based image classification approach to take home the prize. For this challenge, we will provide a simple example of how to use PyTorch Lightning Flash's image classification task with an out-of-the-box architecture (resnet18) to make predictions on the GCMS data. Neural networks have been used to analyze mass spectra in a variety of domains; audio spectrograms, current and voltage spectrograms, and even spectrograms that come from chemical analyses similar to our own have all been analyzed by scientists using neural networks.

The robustness of the first-place winner's model relied in part on much more refined feature engineering and modeling than we provide here; we leave it to you to improve upon this model. Feel free to check out the winners' repo for the previous Mars Spectrometry competition for inspiration.


In order to train a simple neural network on this data, we will need to perform a few more preprocessing steps. While our previous approach relied on using tabular data as input, here we will use an image representation of the mass spectrogram that is based on the same features we used as input into our logistic regression model. As for the logistic regression model above, there are a variety of ways to engineer meaningful features, and we are simply using the features we have already generated for simplicity. There may be better ways to engineer features that yield better performance for image classification.

In [48]:
# We will turn each of the rows in train_features into an image

rounded_mass 0 ... 349
time_bin (0.0, 0.5] (0.5, 1.0] (1.0, 1.5] (1.5, 2.0] (2.0, 2.5] (2.5, 3.0] (3.0, 3.5] (3.5, 4.0] (4.0, 4.5] (4.5, 5.0] ... (20.0, 20.5] (20.5, 21.0] (21.0, 21.5] (21.5, 22.0] (22.0, 22.5] (22.5, 23.0] (23.0, 23.5] (23.5, 24.0] (24.0, 24.5] (24.5, 25.0]
S0000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0001 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0002 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000015 0.000020 0.000014 0.000013 0.000015 0.000017 0.000018 0.000012 0.000011 0.000012
S0003 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0004 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000004 0.000004 0.000002 0.000003 0.000006 0.000006 0.000005 0.000004 0.000004 0.000006
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
S0804 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0805 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0806 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0807 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
S0808 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

809 rows × 17500 columns

In [49]:
# Let's create a few directories to store these

IMG_OUTPUT_PATH = PROJ_ROOT / "data/interim/benchmark_testing"

TRAIN_IMG_PATH = IMG_OUTPUT_PATH / "train_features"
VAL_IMG_PATH = IMG_OUTPUT_PATH / "val_features"
TEST_IMG_PATH = IMG_OUTPUT_PATH / "test_features"
In [51]:
# Remove any existing train, val, or test features image directories and make new blank ones

    except FileNotFoundError:

In order to represent our features in a way that makes it easy to visualize and provide as input to our image classification task, we will scale our 0-1 intensity values to 0-255 and replicate our features three times, one for each of the RGB channels.

In [68]:
def row2img(row, save=False):
    temp_arr = row.values.reshape(350, 50, 1)

    # Scale from 0-1 to 0-255, make uint8, squeeze to just two dims
    temp_arr = np.squeeze((temp_arr * 255).astype(np.uint8), axis=2)

    # Stack to 3 dims to mimic rgb image
    temp_arr = np.stack((temp_arr,) * 3, axis=-1)
    temp_img = Image.fromarray(temp_arr)

    # Save as image
    if save:
        outpath = IMG_OUTPUT_PATH / (
            str(metadata.loc[]["features_path"])[0:-4] + ".jpeg"

    return temp_img

Let's observe what our features look like represented as an image!

In [69]:
train_sample_img = row2img(train_features.iloc[0])
plt.imshow(train_sample_img, aspect="auto")
<matplotlib.image.AxesImage at 0x7f871cc70b50>

There are a few points to note here:

  1. Spectra vs. traditional images: While Lightning Flash has built-in pre-trained image classification model options, none of them were trained on images that quite looked like this. Transfer learning, which is the process of finetuning a model that has been trained on one corpus with another, can improve and speed up the training process. However, transfer learning works best when the original corpus the model was pre-trained on resembles the subsequent corpus.

    Here, it is unclear that our spectrogram representations resemble the original ImageNet images that many of the pre-trained models included in Lightning Flash were trained on.

    In this first pass, we will proceed with simply training a model using our own data, rather than finetuning a pretrained model. Pre-trained models may still be of use for spectrograms, but it may require more research and experimentation to understand how to tune them correctly. For example, some elements that may have been learned by a pre-trained image classification model, such as the position and orientation of lines, may also be useful in working with spectrograms. However some of the other elements that involve color, organic shapes, etc. may not be as useful. Selecting layers of pretrained models to "freeze" or "unfreeze" is an advanced strategy that is implementable using Lightning Flash, but we will forgo it here in favor of keeping our modeling approach simple.

  2. Experiment time and image size: Second, this image representation reminds us that although not all experiments were the same length, the images for each representation will be. This means that the times for which there was no data collected will be represented as 0. This "padding" of the image is sufficient for this benchmark, but there may be other approaches.

In [70]:
# Apply to all training set images
train_features.apply(lambda x: row2img(x, True), axis=1)
S0000    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F876ECF0940>
S0001    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86487D9040>
S0002    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F871CC70B80>
S0003    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F8648FB3070>
S0004    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F8648FB3400>
S0804    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86488B23D0>
S0805    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86488B2400>
S0806    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86488B2430>
S0807    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86488B2460>
S0808    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86488B2490>
Length: 809, dtype: object
In [55]:
# Assembling preprocessed and transformed training set

test_features_dict = {}
print("Total number of test files: ", len(all_test_files))

for i, (sample_id, filepath) in enumerate(tqdm(all_test_files.items())):

    # Load training sample
    temp = pd.read_csv(DATA_PATH / filepath)

    # Preprocessing training sample
    test_sample_pp = preprocess_sample(temp)

    # Feature engineering
    test_sample_fe = int_per_timebin(test_sample_pp).reset_index(drop=True)
    test_features_dict[sample_id] = test_sample_fe

test_features = pd.concat(
    test_features_dict, names=["sample_id", "dummy_index"]
).reset_index(level="dummy_index", drop=True)
Total number of test files:  775
100%|██████████████████████████████████████████████████████████████████████████████| 775/775 [15:27<00:00,  1.20s/it]
In [71]:
# Apply to all training set images
test_features.apply(lambda x: row2img(x, True), axis=1)
S0809    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F871CC538B0>
S0810    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F87086C3430>
S0811    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F87086C3FD0>
S0812    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86A915D8E0>
S0813    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F876ECFA0D0>
S1579    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F8648748850>
S1580    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F8648748880>
S1581    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86487488B0>
S1582    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F86487488E0>
S1583    <PIL.Image.Image image mode=RGB size=50x350 at 0x7F8648748910>
Length: 775, dtype: object
In [72]:
# Store mean and std deviation for normalization later

train_mean = train_features.values.mean()
train_std = train_features.values.std()

print("Training set mean: ", train_mean)
print("Training set standard deviation: ", train_std)
Training set mean:  0.00164897646698432
Training set standard deviation:  0.02051133402028126

Let's also include where to find each corresponding image in the labels file, and make sure that each label is an integer.

In [73]:
train_labels = pd.read_csv(DATA_PATH / "train_labels.csv", index_col="sample_id")

# Collect target class names to use later
target_cols = train_labels.columns.tolist()

# Make sure each label is an integer
train_labels = train_labels.astype(int)

# Include path to jpeg
train_labels["fpath_jpeg"] = TRAIN_IMG_PATH / (train_labels.index + ".jpeg")
aromatic hydrocarbon carboxylic_acid nitrogen_bearing_compound chlorine_bearing_compound sulfur_bearing_compound alcohol other_oxygen_bearing_compound mineral fpath_jpeg
S0000 0 0 0 0 0 0 0 0 1 /Users/ishashah/DrivenData/competition-nasa-mars-gcms/data/interim/benchmark...
S0001 0 0 0 0 0 0 0 0 0 /Users/ishashah/DrivenData/competition-nasa-mars-gcms/data/interim/benchmark...
S0002 0 0 1 1 0 0 0 0 1 /Users/ishashah/DrivenData/competition-nasa-mars-gcms/data/interim/benchmark...
S0003 0 1 0 0 0 0 0 0 0 /Users/ishashah/DrivenData/competition-nasa-mars-gcms/data/interim/benchmark...
S0004 0 0 0 0 1 0 0 0 1 /Users/ishashah/DrivenData/competition-nasa-mars-gcms/data/interim/benchmark...
In [74]:
# Check to make sure paths look ok

In [82]:
# Let's do the same for the test features, using the submission format file

submission_format = pd.read_csv(
    DATA_PATH / "submission_format.csv", index_col="sample_id"
submission_format["fpath_jpeg"] = submission_format.index.to_series().apply(
    lambda x: IMG_OUTPUT_PATH
    / (str(metadata.loc[x]["split"]) + "_features")
    / (str(x) + ".jpeg")


Choosing which model to implement and how can be a complicated decision. Here, we have decided for simplicity to use resnet18, which is a deep learning architecture that has been used to success on some image classification tasks (a schema for resnet18 as well as its counterparts with more layers can be found here). The architecture is one of the built-in backbones that Lightning Flash offers. The winner for the previous Mars Spectrometry competition used resnet34.

In [76]:
# For reproducibility
seed_everything(42, workers=True)
Global seed set to 42

A DataModule is a Lightning class that allows for easy ingestion of data by Lightning processing mechanisms. It includes information about where the inputs can be found, their labels, the size of the batch they should be ingested in, and how they should be transformed. Here, we are specifying that the dataframe train_labels contains the path to our input images (fpath_jpeg), and the labels that correspond to each sample (the list target_cols).

Normalization is a common preprocessing step when using a neural network for image classification. Since we are training on our own data rather than finetuning a pre-trained model, we will normalize to our own dataset's mean and standard deviation. Specifying the image size is also a common tranformation step, which you may leverage if your preprocessing results in images of varying sizes. It is also common when using pre-trained models to choose to normalize and resize one's images with the same parameters as those of the initial training set.

Finally, while it is typical practice to split the training set and use some of it for local evaluation, we are not doing so in this example in the interest of keeping this example computationally inexpensive.

In [77]:
# 1. Create DataModule

datamodule_train = ImageClassificationData.from_data_frame(
        "mean": (train_mean, train_mean, train_mean),
        "std": (train_std, train_std, train_std),

As we build our ImageClassifier, we want to specify a few key parameters:

  • backbone: As mentioned earlier, resnet architectures are built-in backbone options for Lightning Flash, and all we need to do to use resnet18 is to set backbone="resnet18".

  • multi_label: The default ImageClassifier is for single-label data, so we must specify that our problem is a multi-label classification one.

In [78]:
# 2. Build the task

model_10 = ImageClassifier(

In order to retrieve the probabilities of every target class rather than the labels, we will attach ProbabilitiesOutput() to our model and specify that we are seeking multi-label classification. (The default value for the multi_label argument is False).

In [79]:
# 3. Attach the Output
model_10.output = ProbabilitiesOutput(multi_label=True)

When it comes to deciding on the number of epochs to train our model for, we want to account for the relatively small size of our dataset and a reasonable time constraint. 10 epochs seems like an appropriate starting point, as it is unlikely the model with overfit at this number, and we will be able to get predictions relatively quickly.

In [80]:
# 3. Create the trainer and train the model

trainer = flash.Trainer(max_epochs=10, gpus=torch.cuda.device_count()), datamodule=datamodule_train)
/Users/ishashah/opt/anaconda3/envs/competition-nasa-mars-gcms/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/ LightningDeprecationWarning: Setting `Trainer(gpus=0)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=0)` instead.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/ishashah/opt/anaconda3/envs/competition-nasa-mars-gcms/lib/python3.8/site-packages/pytorch_lightning/trainer/ PossibleUserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.

  | Name          | Type           | Params
0 | train_metrics | ModuleDict     | 0     
1 | val_metrics   | ModuleDict     | 0     
2 | test_metrics  | ModuleDict     | 0     
3 | adapter       | DefaultAdapter | 11.2 M
11.2 M    Trainable params
0         Non-trainable params
11.2 M    Total params
44.725    Total estimated model params size (MB)
/Users/ishashah/opt/anaconda3/envs/competition-nasa-mars-gcms/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/ PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
Epoch 9: 100%|█| 809/809 [02:52<00:00,  4.69it/s, loss=0.118, v_num=30, train_f1score_step=0.000, train_binary_cross_
`` stopped: `max_epochs=10` reached.
Epoch 9: 100%|█| 809/809 [02:53<00:00,  4.68it/s, loss=0.118, v_num=30, train_f1score_step=0.000, train_binary_cross_
In [83]:
# 4. Predict the targets

datamodule_predict = ImageClassificationData.from_data_frame(
        "mean": (train_mean, train_mean, train_mean),
        "std": (train_std, train_std, train_std),
In [84]:
predictions_10 = trainer.predict(
    model_10, datamodule=datamodule_predict, output=model_10.output

# 5. Save the model
/Users/ishashah/opt/anaconda3/envs/competition-nasa-mars-gcms/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/ PossibleUserWarning: The dataloader, predict_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 8 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
Predicting DataLoader 0: 100%|█████████████████████████████████████████████████████| 775/775 [00:40<00:00, 19.28it/s]
In [85]:
resnet18_e10_nopre_df = pd.DataFrame(
    np.array(predictions_10).squeeze(axis=1), index=sample_order, columns=target_cols
aromatic hydrocarbon carboxylic_acid nitrogen_bearing_compound chlorine_bearing_compound sulfur_bearing_compound alcohol other_oxygen_bearing_compound mineral
S0809 0.072457 0.077199 0.058062 0.017176 0.011595 0.006512 0.003139 0.022854 0.047951
S0810 0.028844 0.012427 0.021854 0.007058 0.009996 0.001647 0.002024 0.012938 0.170419
S0811 0.049311 0.257839 0.064799 0.032371 0.021331 0.011655 0.006261 0.044474 0.052181
S0812 0.070141 0.266545 0.064699 0.027525 0.037419 0.016334 0.010062 0.074294 0.085174
S0813 0.057369 0.231447 0.048100 0.022557 0.027260 0.011107 0.006816 0.051931 0.085969

Like for our logistic regression model, let's check to make sure that the row and column order match the submission format before submitting.

In [86]:
assert resnet18_e10_nopre_df.index.equals(submission_template_df.index)
assert resnet18_e10_nopre_df.columns.equals(submission_template_df.columns)
In [87]:

When we upload the submission to the platform, we will see this:

Screen Shot 2022-09-07 at 6.09.08 PM.png

While the performance for this particular model isn't too impressive compared to our logistic regression (it performs about the same as the naive model), it might be a good starting point upon which to iterate. We look forward to seeing what you come up with!


We hope this benchmark solution provided a helpful framework for getting started. All three of the benchmark models have been added to the leaderboard for reference. Head over to the Mars Spectrometry 2: Gas Chromatography homepage for more details on the competition, and you can also check out the forum. We can't wait to see what you build!

Image of Curiosity courtesy of NASA/JPL-Caltech/MSSS