Genetic engineering is a powerful tool with a wide array of applications. It's being used to do everything from creating safe alternatives to pesticides to producing critical ingredients for medicines and industry. However, it's rise and increasing accessibility also comes with the potential for lab accidents, purposeful wrongdoing, and abuse.

Genetic engineering attribution is a tool that can help promote accountability and deter bad actors. In our newest competition, we're asking you to to create an algorithm that identifies the lab-of-origin for genetically engineered DNA—with the highest accuracy level possible. The metric for this competition is top ten accuracy.

CDC scientist examines the results of a hemagglutinin inhibition (HI) test.

The DNA for this challenge comes from an online dataset of plasmids. Plasmids are small, circular DNA molecules that replicate independently from chromosomes. Plasmids used commonly in research come from cells in mammals, bacteria, worms, yeast, and insects. Plasmid information can be uploaded and accessed digitally thanks to AddGene, a large, non-profit plasmid repository based in Watertown, MA.

In this benchmark, we'll cover how to:

load the data
create text-based features from DNA sequences
train a random forest model and make predictions
create a custom scorer based on top-ten accuracy
submit predictions to the competition

To get started, we import the standard tools of the trade.

In [1]:

from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


%matplotlib inline
sns.set()

Loading the Data¶

On the data download page, we provide everything you need to get started:

Training Features: These are the features you'll use to train a model. There are 39 features, including the plasmid DNA sequence and metadata about the plasmids. Each row has an identifier called sequence_id, a unique 5-character alphanumeric string.
Training Labels: These are the labels. Every sequence_id in the training values data has a corresponding label in this file, given by a lab_id. A lab_id is a unique, 8-character alphanumeric identifier for a particular lab. The labels are one-hot encoded.
Test Features: These are the features you'll use to make predictions after training a model. We don't give you the labels for these samples; it's up to you to predict labs for these sequence_ids.
Submission Format: This is what a submission should look like, including the file, column, and row names. Since evaluation is based on top ten accuracy, you'll need to submit class probabilities instead of a single class prediction. The submission format has random floats between 0 and 1 filled in as a baseline. Your submission to the leaderboard must be in this exact form (with different probability values, of course) in order to be scored successfully!

Since this is a benchmark, we're only going to use a subset of the features in the dataset. It's up to you to take advantage of all the information!

In [2]:

DATA_DIR = Path.cwd().parent / "data" / "final" / "public"

In [3]:

train_values = pd.read_csv(DATA_DIR / "train_values.csv", index_col="sequence_id")
train_labels = pd.read_csv(DATA_DIR / "train_labels.csv", index_col="sequence_id")
test_values = pd.read_csv(DATA_DIR / "test_values.csv", index_col="sequence_id")

Explore the Data¶

Let's take a quick look at all of the data we have available. First let's check out our features.

Training set¶

In [4]:

train_values.head()

Out[4]:

	sequence	bacterial_resistance_ampicillin	bacterial_resistance_chloramphenicol	bacterial_resistance_kanamycin	bacterial_resistance_other	bacterial_resistance_spectinomycin	copy_number_high_copy	copy_number_low_copy	copy_number_unknown	growth_strain_ccdb_survival	...	species_budding_yeast	species_fly	species_human	species_mouse	species_mustard_weed	species_nematode	species_other	species_rat	species_synthetic	species_zebrafish
sequence_id
9ZIMC	CATGCATTAGTTATTAATAGTAATCAATTACGGGGTCATTAGTTCA...	0.0	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
5SAQC	GCTGGATGGTTTGGGACATGTGCAGCCCCGTCTCTGTATGGAGTGA...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
E7QRO	NNCCGGGCTGTAGCTACACAGGGCGGAGATGAGAGCCCTACGAAAG...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
CT5FP	GCGGAGATGAAGAGCCCTACGAAAGCTGAGCCTGCGACTCCCGCAG...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
7PTD8	CGCGCATTACTTCACATGGTCCTCAAGGGTAACATGAAAGTGATCC...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 40 columns

We have one non-numeric column, sequence, which contains the DNA plasmid sequence.

In [5]:

train_values.dtypes

Out[5]:

sequence                                 object
bacterial_resistance_ampicillin         float64
bacterial_resistance_chloramphenicol    float64
bacterial_resistance_kanamycin          float64
bacterial_resistance_other              float64
bacterial_resistance_spectinomycin      float64
copy_number_high_copy                   float64
copy_number_low_copy                    float64
copy_number_unknown                     float64
growth_strain_ccdb_survival             float64
growth_strain_dh10b                     float64
growth_strain_dh5alpha                  float64
growth_strain_neb_stable                float64
growth_strain_other                     float64
growth_strain_stbl3                     float64
growth_strain_top10                     float64
growth_strain_xl1_blue                  float64
growth_temp_30                          float64
growth_temp_37                          float64
growth_temp_other                       float64
selectable_markers_blasticidin          float64
selectable_markers_his3                 float64
selectable_markers_hygromycin           float64
selectable_markers_leu2                 float64
selectable_markers_neomycin             float64
selectable_markers_other                float64
selectable_markers_puromycin            float64
selectable_markers_trp1                 float64
selectable_markers_ura3                 float64
selectable_markers_zeocin               float64
species_budding_yeast                   float64
species_fly                             float64
species_human                           float64
species_mouse                           float64
species_mustard_weed                    float64
species_nematode                        float64
species_other                           float64
species_rat                             float64
species_synthetic                       float64
species_zebrafish                       float64
dtype: object

The length of these sequences vary. Some are as short as 20 characters and others are as long as 60,000!

In [6]:

sequence_lengths = train_values.sequence.apply(len)

sequence_lengths.describe()

Out[6]:

count    63017.000000
mean      4839.025501
std       3883.148431
min         20.000000
25%        909.000000
50%       4741.000000
75%       7490.000000
max      60099.000000
Name: sequence, dtype: float64

In [7]:

sequence_lengths.plot(
    kind="hist", title="Distribution of DNA Sequence Lengths", bins=100, xlim=(0, 20000)
);  # don't plot super long sequences

The rest of our features are binary, one hot encoded columns.

In [8]:

# exclude the 0th column which is the dna sequence
train_values.iloc[:, 1:].apply(pd.value_counts)

Out[8]:

	bacterial_resistance_ampicillin	bacterial_resistance_chloramphenicol	bacterial_resistance_kanamycin	bacterial_resistance_other	bacterial_resistance_spectinomycin	copy_number_high_copy	copy_number_low_copy	copy_number_unknown	growth_strain_ccdb_survival	growth_strain_dh10b	...	species_budding_yeast	species_fly	species_human	species_mouse	species_mustard_weed	species_nematode	species_other	species_rat	species_synthetic	species_zebrafish
0.0	19590	60506	48706	62078	58592	18750	57313	50059	61841	61285	...	61417	61478	36890	57732	61978	62322	56381	61949	55170	61742
1.0	43427	2511	14311	939	4425	44267	5704	12958	1176	1732	...	1600	1539	26127	5285	1039	695	6636	1068	7847	1275

2 rows × 39 columns

Since these columns are binary, the mean value shows us the prevalence, meaning what proportion of sequences have that characteristic. Check out the problem description page for definitions of each feature.

This plot shows us that a few features are common across sequences but most are quite rare.

In [9]:

sorted_binary_features = train_values.iloc[:, 1:].mean().sort_values()

ax = sorted_binary_features.plot(
    kind="barh", stacked=True, figsize=(5, 12), title="Prevalence of Binary Features"
)
ax.set_xlabel("Proportion of sequences");

Test set¶

Our test set features are set up in the same way as the training set.

In [10]:

test_values.head()

Out[10]:

	sequence	bacterial_resistance_ampicillin	bacterial_resistance_chloramphenicol	bacterial_resistance_kanamycin	bacterial_resistance_other	bacterial_resistance_spectinomycin	copy_number_high_copy	copy_number_low_copy	copy_number_unknown	growth_strain_ccdb_survival	...	species_budding_yeast	species_fly	species_human	species_mouse	species_mustard_weed	species_nematode	species_other	species_rat	species_synthetic	species_zebrafish
sequence_id
E0VFT	AGATCTATACATTGAATCAATATTGGCAATTAGCCATATTAGTCAT...	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
TTRK5	GCGCGCGTTGACATTGATTATTGACTAGTTATTAATAGTAATCAAT...	0.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2Z7FZ	GCTTAAGCGGTCGACGGATCGGGAGATCTCCCGATCCCCTATGGTG...	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
VJI6E	ATGATGATGATGTCCCTGAACAGCAAGCAGGCGTTTAGCATGCCGC...	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
721FI	GGTACCGAGCTCTTACGCGTGCTAGCCATACTATCAGCCACTTGTG...	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 40 columns

In [11]:

test_values.sequence.apply(len).describe()

Out[11]:

count    18816.000000
mean      4875.523810
std       4004.117614
min         19.000000
25%        894.750000
50%       4732.000000
75%       7341.250000
max      38638.000000
Name: sequence, dtype: float64

Training Labels¶

Now let's look at our labels. It's important to recognize that these are one-hot encoded. Each column represents a lab ID. The correct lab for a given sequence_id is indicated by a 1.0 in that column, and 0.0 otherwise.

In [12]:

train_labels.head()

Out[12]:

	00Q4V31T	012VT4JK	028IO5W2	03GRNN7N	03Y3W51H	09MQV1TY	0A4AHRCT	0A9M05NC	0B9GCUVV	0CL7QVG8	...	ZQNGGY33	ZSHS4VJZ	ZT1IP3T6	ZU6860XU	ZU6TVFFU	ZU75P59K	ZUI6TDWV	ZWFD8OHC	ZX06ZDZN	ZZJVE4HO
sequence_id
9ZIMC	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5SAQC	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
E7QRO	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
CT5FP	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
7PTD8	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 1314 columns

To make our modeling and analysis simpler, let's collapse the labels dataframe into one column that shows us the lab ID for each sequence.

In [13]:

# get the column with the max value in each row
lab_ids = pd.DataFrame(train_labels.idxmax(axis=1), columns=["lab_id"])
lab_ids.head()

Out[13]:

	lab_id
sequence_id
9ZIMC	RYUA3GVO
5SAQC	RYUA3GVO
E7QRO	RYUA3GVO
CT5FP	RYUA3GVO
7PTD8	RYUA3GVO

Labs can have anywhere from 1 to over 8,000 sequences in the data. On average, there are about 48 sequences per lab in the training data.

In [14]:

# get the distrubtion of lab prevalence in the training set
lab_ids["lab_id"].value_counts().describe()

Out[14]:

count    1314.000000
mean       47.958143
std       262.552258
min         1.000000
25%         9.000000
50%        15.000000
75%        34.000000
max      8286.000000
Name: lab_id, dtype: float64

Sorting labs by prevalence, we see that lab ID I7FXTVDP accounts for 13% of the sequences in the train set. The top 3 labs account for more than 20% of the data. We'll have to be mindful of class imbalance in our training.

In [15]:

# Sort lab ids by prevalence
(lab_ids["lab_id"].value_counts(normalize=True).sort_values(ascending=False)).head()

Out[15]:

I7FXTVDP    0.131488
RKJHZGDQ    0.043353
GTVTUGVY    0.042401
A18S09P2    0.016884
Q2K8NHZY    0.015440
Name: lab_id, dtype: float64

Construct Features from DNA Sequences¶

For this benchmark, we're going to use the DNA sequences as the basis for the features in our model. We've got some preprocessing work to do to turn these lengthy strings into useful features!

These sequences are composed of five characters. G, C, A, and T represent the four nucleotides commonly found in DNA (guanine, cytosine, adenine, thymine). N stand for any nucleotide (not a gap).

In [16]:

bases = set("".join(train_values.sequence.values))
bases

Out[16]:

{'A', 'C', 'G', 'N', 'T'}

One common way to turn strings into useful features is to count n-grams, or continuous subsequences of length n. Here, we'll split up the DNA sequences into four-grams, or subsequences consisting of 4 bases.

With 5 unique bases, we can produce 120 different sequence permutations consisting of 4 bases. You can play around with the length of n to see how it affects your model.

In [17]:

from itertools import permutations

n = 4
subsequences = ["".join(permutation) for permutation in permutations(bases, r=n)]

In [18]:

print(f"Number of subsequences: {len(subsequences)}")
subsequences[:10]

Number of subsequences: 120

Out[18]:

['ANTG',
 'ANTC',
 'ANGT',
 'ANGC',
 'ANCT',
 'ANCG',
 'ATNG',
 'ATNC',
 'ATGN',
 'ATGC']

We can now turn our strings into features by taking a simple count of each subsequence. We could do this one of two ways:

Overlapping substrings: To count overlapping substrings, we would use a sliding window such that the sequence "ATTATTA" will result in a count of 2 for the substring "ATTA" ("ATTA-TTA" and "ATT-ATTA").
Non-overlapping substrings: To count non-overlapping substrings, we search for each substring. If we find it, we count it and then continue our search at the end of the substring. In this case, "ATTATTA" will result in a count of 1 for the substring "ATTA" ("ATTA-TTA").

For simplicity, we're going to use the second method of non-overlapping substrings. We can use the built in count method on strings. Feel free to try both methods to see how it affects your model.

In [19]:

# Example of built-in count method on strings
# Because it's non-overlapping, "atta" is only counted twice
"gattattattaca".count("atta")

Out[19]:

Now let's create a helper function to generate features. get_ngram_features will create a new dataframe that holds the counts for each subsequence and row in our data.

In [20]:

def get_ngram_features(data, subsequences):
    """Generates counts for each subsequence.

    Args:
        data (DataFrame): The data you want to create features from. Must include a "sequence" column.
        subsequences (list): A list of subsequences to count.

    Returns:
        DataFrame: A DataFrame with one column for each subsequence.
    """
    features = pd.DataFrame(index=data.index)
    for subseq in subsequences:
        features[subseq] = data.sequence.str.count(subseq)
    return features

Using our helper function, we can generate feautures for all the sequences in our training set! This will take a minute or two to run.

In [21]:

# Calculate n-gram features on our training set
ngram_features = get_ngram_features(train_values, subsequences)
ngram_features.head()

Out[21]:

	ANTG	ANTC	ANGT	ANGC	ANCT	ANCG	ATNG	ATNC	ATGN	ATGC	...	CTNA	CTNG	CTGA	CTGN	CGAN	CGAT	CGNA	CGNT	CGTA	CGTN
sequence_id
9ZIMC	0	0	0	0	0	0	0	0	0	30	...	0	0	44	0	0	17	0	0	14	0
5SAQC	0	0	0	0	0	0	0	0	0	2	...	0	0	6	0	0	0	0	0	0	0
E7QRO	0	0	0	1	0	1	0	0	0	1	...	0	0	2	2	1	1	0	0	0	0
CT5FP	0	0	0	0	0	0	0	0	0	2	...	0	0	8	0	0	0	0	0	1	0
7PTD8	0	0	0	1	0	0	0	0	0	6	...	0	0	4	0	0	1	0	0	2	0

5 rows × 120 columns

In [22]:

ngram_features.shape

Out[22]:

(63017, 120)

We now have features for all 120 possible subsequences. Their values show the counts of each 4-gram within the full DNA sequence.

Let's join them with our one-hot endcoded binary features.

In [23]:

all_features = ngram_features.join(train_values.drop("sequence", axis=1))

In [24]:

all_features.head()

Out[24]:

	ANTG	ANTC	ANGT	ANGC	ANCT	ANCG	ATNG	ATNC	ATGN	ATGC	...	species_budding_yeast	species_fly	species_human	species_mouse	species_mustard_weed	species_nematode	species_other	species_rat	species_synthetic	species_zebrafish
sequence_id
9ZIMC	0	0	0	0	0	0	0	0	0	30	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
5SAQC	0	0	0	0	0	0	0	0	0	2	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
E7QRO	0	0	0	1	0	1	0	0	0	1	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
CT5FP	0	0	0	0	0	0	0	0	0	2	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0
7PTD8	0	0	0	1	0	0	0	0	0	6	...	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 159 columns

In [25]:

all_features.shape

Out[25]:

(63017, 159)

The Error Metric: Top Ten Accuracy¶

The goal for this competition is to help narrow down the field of possible labs-of-origin from thousands to just a few. To that end, predictions will be evaluated based on top-ten accuracy. That means we'll consider a prediction "correct" if the true lab-of-origin is in the top ten most likely labs.

There is not a built in evaluation metric for top-k accuracy in scikit-learn, so we'll be constructing a custom scorer. We'll use this to determine the final accuracy of our model.

First, we'll need to create a custom scorer that can take in an estimator, validation data, and labels, and output a score based on the top ten results from each predicton. Scikit-learn let's us do this if we adhere to a specific signature.

From the documentation:

For a callable to be a scorer, it needs to meet the protocol specified by the following two rules:

It can be called with parameters (estimator, X, y), where estimator is the model that should be evaluated, X is validation data, and y is the ground truth target for X (in the supervised case) or None (in the unsupervised case).

It returns a floating point number that quantifies the estimator prediction quality on X, with reference to y. Again, by convention higher numbers are better, so if your scorer returns loss, that value should be negated.

In [26]:

def top10_accuracy_scorer(estimator, X, y):
    """A custom scorer that evaluates a model on whether the correct label is in
    the top 10 most probable predictions.

    Args:
        estimator (sklearn estimator): The sklearn model that should be evaluated.
        X (numpy array): The validation data.
        y (numpy array): The ground truth labels.

    Returns:
        float: Accuracy of the model as defined by the proportion of predictions
               in which the correct label was in the top 10. Higher is better.
    """
    # predict the probabilities across all possible labels for rows in our training set
    probas = estimator.predict_proba(X)

    # get the indices for top 10 predictions for each row; these are the last ten in each row
    # Note: We use argpartition, which is O(n), vs argsort, which uses the quicksort algorithm
    # by default and is O(n^2) in the worst case. We can do this because we only need the top ten
    # partitioned, not in sorted order.
    # Documentation: https://numpy.org/doc/1.18/reference/generated/numpy.argpartition.html
    top10_idx = np.argpartition(probas, -10, axis=1)[:, -10:]

    # index into the classes list using the top ten indices to get the class names
    top10_preds = estimator.classes_[top10_idx]

    # check if y-true is in top 10 for each set of predictions
    mask = top10_preds == y.reshape((y.size, 1))

    # take the mean
    top_10_accuracy = mask.any(axis=1).mean()

    return top_10_accuracy

Build the Model¶

Random forests are often a good first model to try so we'll start there. We'll leave more complicated modeling and feature selection up to you!

Random Forest¶

It's easy to build a random forest model with Scikit Learn. We're going to create a simple model with a few specified hyperparameters.

Tip: If you wanted to get fancy, you could perform a cross-validated grid search with GridSearchCV. You could even use the custom scorer we made earlier to train your model by passing it in via the scoring parameter.

In [27]:

from sklearn.ensemble import RandomForestClassifier

Let's generate our labels and rename our features to match.

In [28]:

# Rename our feature array
X = all_features

# Create our labels
y = lab_ids.values.ravel()

We've got our features and our labels. Time to train!

Wait a sec... aren't we forgetting something? Oh right, we still have to address the class imbalance we discovered earlier. Luckily, scikit-learn has an easy solution for us. We can set class_weight to "balanced". This will set class weights inversely proportional to the class frequency.

Great! Now let's generate and fit our model.

In [29]:

# instantiate our RF Classifier
rf = RandomForestClassifier(
    n_jobs=4,
    n_estimators=150,
    class_weight="balanced",  # balance classes
    max_depth=3,  # shallow tree depth to prevent overfitting
    random_state=0,  # set a seed for reproducibility
)

# fit our model
rf.fit(X, y)

Out[29]:

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=3, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=4,
                       oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

Let's see how our model scored on our training data.

In [30]:

rf.score(X, y)

Out[30]:

0.1929955408857927

Nice! 19.30% accuracy isn't too shabby for our first try. However, that's using the default scorer, which is vanilla top-1 accuracy. We should expect to do better on the competition metric, top-10 accuracy. Let's use our custom defined scorer to see how we did:

In [31]:

top10_accuracy_scorer(rf, X, y)

Out[31]:

0.3784534332005649

Our model got about 37.85% top ten accuracy. Not bad! Now let's make some predictions using our newly trained model.

Time to Predict and Submit¶

Now we'll create features from our test set and generate predictions. We need to make sure to generate probabilities for each lab ID so that the top 10 accuracy metric can be computed.

First, let's create the 4-gram features and join it with the binary features.

In [32]:

test_ngram_features = get_ngram_features(test_values, subsequences)
all_test_features = test_ngram_features.join(test_values.drop("sequence", axis=1))

Make Predictions¶

It's important to use predict_proba here, which gives us class probabilities, instead of predict, which would give us class predictions.

In [33]:

probas = rf.predict_proba(all_test_features)

# Examine first row
probas[0]

Out[33]:

array([0.00087402, 0.00107841, 0.00107348, ..., 0.00095991, 0.00026639,
       0.00017803])

Save Submission¶

Now let's get our matrix of probabilities ready for submission.

In [34]:

submission_format = pd.read_csv(
    DATA_DIR / "submission_format.csv", index_col="sequence_id"
)

In [35]:

submission_format.head()

Out[35]:

	00Q4V31T	012VT4JK	028IO5W2	03GRNN7N	03Y3W51H	09MQV1TY	0A4AHRCT	0A9M05NC	0B9GCUVV	0CL7QVG8	...	ZQNGGY33	ZSHS4VJZ	ZT1IP3T6	ZU6860XU	ZU6TVFFU	ZU75P59K	ZUI6TDWV	ZWFD8OHC	ZX06ZDZN	ZZJVE4HO
sequence_id
E0VFT	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
TTRK5	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2Z7FZ	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
VJI6E	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
721FI	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 1314 columns

Before we create our submission csv, let's make sure the submission format matches the shape of our predicted probabilities. We're also going to check that the predicted classes are in the same order as the submission format columns.

In [36]:

assert submission_format.shape == probas.shape
assert (rf.classes_ == submission_format.columns).all()

In [37]:

my_submission = pd.DataFrame(
    data=probas, columns=rf.classes_, index=submission_format.index
)

In [38]:

my_submission.head()

Out[38]:

	00Q4V31T	012VT4JK	028IO5W2	03GRNN7N	03Y3W51H	09MQV1TY	0A4AHRCT	0A9M05NC	0B9GCUVV	0CL7QVG8	...	ZQNGGY33	ZSHS4VJZ	ZT1IP3T6	ZU6860XU	ZU6TVFFU	ZU75P59K	ZUI6TDWV	ZWFD8OHC	ZX06ZDZN	ZZJVE4HO
sequence_id
E0VFT	0.000874	0.001078	0.001073	0.000082	0.005858	0.001078	0.001029	0.001055	0.000215	0.000151	...	0.000911	0.000564	0.000167	0.000182	0.001259	0.000672	0.001338	0.000960	0.000266	0.000178
TTRK5	0.000838	0.000989	0.001024	0.000162	0.000982	0.001868	0.000762	0.000967	0.000359	0.000110	...	0.001487	0.000787	0.000294	0.000103	0.000813	0.000824	0.000936	0.002691	0.000228	0.000162
2Z7FZ	0.000665	0.000815	0.000799	0.000290	0.000686	0.001456	0.002528	0.001027	0.000163	0.000117	...	0.001060	0.000793	0.000116	0.000119	0.001003	0.001022	0.001071	0.000761	0.000434	0.000328
VJI6E	0.000447	0.000135	0.000120	0.001868	0.000245	0.000108	0.000086	0.000122	0.002021	0.002008	...	0.000158	0.000207	0.002026	0.001736	0.000174	0.000094	0.000115	0.000202	0.001320	0.001670
721FI	0.000947	0.001126	0.001121	0.000232	0.000970	0.000690	0.000904	0.001086	0.000393	0.000213	...	0.001229	0.000660	0.000401	0.000183	0.000864	0.000709	0.001059	0.001484	0.000332	0.000271

5 rows × 1314 columns

Looks good! Let's save and submit!

In [39]:

my_submission.to_csv("submission.csv")

Submit to Leaderboard¶

Woohoo, we scored 33.2%! We hope this benchmark has shown you how to get started. We can't wait to see what you come up with. Happy modeling!

Genetic Engineering Attribution - Benchmark

Loading the Data¶

Explore the Data¶

Training set¶

Test set¶

Training Labels¶

Construct Features from DNA Sequences¶

The Error Metric: Top Ten Accuracy¶

Build the Model¶

Random Forest¶

Time to Predict and Submit¶

Make Predictions¶

Save Submission¶

Submit to Leaderboard¶

Tags

Latest posts

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

Community Spotlight: Kirill Brodt

A production application to support survivors of human trafficking

Life beyond the leaderboard

(Tech) Infrastructure Week for the Nonprofit Sector

Meet the winners of Phase 2 of the PREPARE Challenge

AI sauce on everything: Reflections on ASU+GSV 2025

Open-source packages for using speech data in ML

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Crowdsourcing solutions for AI-assisted early literacy screening

Where to find a data job for a good cause

Meet the Winners of the Youth Mental Health Narratives Challenge

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

10 takeaways from 10 years of data science for social good

Goodnight Moon, Hello Early Literacy Screening Benchmark

Youth Mental Health: Automated Abstraction Benchmark

Meet the winners of Phase 1 of the PREPARE Challenge

Teaching with DrivenData Competitions

What a non-profit shutting down tells us about AI in the social sector

Work with us to build a better world

Loading the Data¶

Explore the Data¶

Training set¶

Test set¶

Training Labels¶

Construct Features from DNA Sequences¶

The Error Metric: Top Ten Accuracy¶

Build the Model¶

Random Forest¶

Time to Predict and Submit¶

Make Predictions¶

Save Submission¶

Submit to Leaderboard¶

Tags

Stay updated

Latest posts

Work with us to build a better world