Benchmark - Hakuna Ma-data: Identify Wildlife on the Serengeti with AI for Earth¶

Welcome to the benchmark solution tutorial for our new competition in partnership with AI for Earth! In this computer-vision competition, you are tasked with identifying animal species in camera trap footage. The training data consists of over 2.5 million sequences of images collected using camera traps placed in the Serengeti region of Africa. The sequences are one-hot-labeled for 53 different species groups, or as empty. For each sequence (which may be multiple images), you will generate a submission that consists of probabilities for each possible class.

But the fun doesn't stop there! In this competition, you will not be submitting a csv of predictions. Instead, you will submit the code that performs inference on the test data, and we will execute that code in the cloud to generate and score your submission.

In this benchmark, we'll walk through a first-pass approach to loading, understanding, and preparing the data. We'll use an out-of-the-box transfer learning approach to train a Keras model on a subset of the training data. Then we'll explain how to package up your model and submit a file capable of running in our cloud-based execution environment. With all these pipes connected, you'll be ready to hit the ground running.

We've got a large territory to cover, so let's get started!

Data Exploration¶

Our training set consists of 10 "seasons" of footage. We can get information about each seson's sequences and labels using the training metadata, which provides links between image flenames and sequences, as well as the training labels, which tell us what's in a given sequence.

You can download the metadata as well as the actual image data here. Keep in mind, this is a very large training set, so make sure you have a storage solution. The complete set of images for all 10 seasons is nearly 5 TB! But the image data is split by season, so you can download image files one season at a time. For this benchmark, we're only going to work with a couple of seasons.

Let's load the metadata and look at what we have.

In [1]:

import json
from pathlib import Path

import numpy as np
import pandas as pd

pd.set_option("max_colwidth", 80)

# This is where our downloaded images and metadata live locally
DATA_PATH = Path.cwd().parent / "data/final/public/"

In [2]:

train_metadata = pd.read_csv(DATA_PATH / "train_metadata.csv")
train_labels = pd.read_csv(DATA_PATH / "train_labels.csv", index_col="seq_id")

In [3]:

train_metadata.head()

Out[3]:

	file_name	seq_id	datetime
0	S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG	SER_S1#B04#1#3	2010-07-20 06:14:06
1	S1/B04/B04_R1/S1_B04_R1_PICT0004.JPG	SER_S1#B04#1#4	2010-07-22 08:56:06
2	S1/B04/B04_R1/S1_B04_R1_PICT0005.JPG	SER_S1#B04#1#5	2010-07-24 01:16:28
3	S1/B04/B04_R1/S1_B04_R1_PICT0006.JPG	SER_S1#B04#1#6	2010-07-24 08:20:10
4	S1/B04/B04_R1/S1_B04_R1_PICT0007.JPG	SER_S1#B04#1#7	2010-07-24 10:14:32

Number of images in the train set.

In [4]:

train_metadata.shape[0]

Out[4]:

That's a lot of images! However, each image is associated to a sequence. And our predictions will be at the sequence level.

A sequence is an ordered series, in this case of camera trap images ordered in time. When the camera trap is triggered, it often takes more than one image, yielding a an image sequence. In the data, each sequence has a seq_id, which is the index for the labels. The order of images in a given sequence can be inferred from the last four digits in the filename.

Since our predictions are made at the sequence level, this means that there is one label for any given sequence. Imagine that lion walks by and triggers the camera trap, which takes 4 pictures. If the lion keeps walking, they may only appear in the first three pictures in the sequence and be out of frame by the time the 4th image is taken. Despite the 4th frame being empty of any lion, that sequence of 4 images would still be labeled as lion.

Let's confirm that each label in train_labels corresponds to a unique seq_id.

In [5]:

assert train_metadata.seq_id.nunique() == train_labels.index.shape[0]

# number of sequences
train_metadata.seq_id.nunique()

Out[5]:

We have different seasons in the training set, and different numbers of images per season. We can see which season an image belongs to by looking at the first few characters of the sequence ID.

In [6]:

train_metadata["season"] = train_metadata.seq_id.map(lambda x: x.split("#")[0])
train_metadata.season.value_counts().sort_index()

Out[6]:

SER_S1     406612
SER_S10    682736
SER_S2     567428
SER_S3     388335
SER_S4     528504
SER_S5     822793
SER_S6     458417
SER_S7     827051
SER_S8     975394
SER_S9     977105
Name: season, dtype: int64

Keep in mind that the test set comes from seasons not represented in the training set. So our model needs to generalize to seasons it hasn't seen before.

As we can see, location values are not unique between seasons. So while our model will need to generalize across seasons, it might get to "revisit" the same location from season to season.

Below we see that the number of images we have for each sequence varies, but by far most sequences have between 1 and 3 images in them.

In [7]:

train_metadata.groupby("seq_id").size().value_counts().sort_index()

Out[7]:

1      359209
2       11457
3     2084000
4          15
5           1
6           3
8           1
9           4
10          1
12          2
13          1
14          1
27          1
37          1
dtype: int64

For this benchmark, we're going to simplify the problem by taking only the first image from each sequence. The justifying assumption here is that the period of time immediately after a camera trap is first triggered is the most likely time to see an animal in frame. However, this may not always be the case and ultimately you'll probably want to give your model as much information as you can by using more images per sequence.

In [8]:

# reduce to first frame only for all sequences
train_metadata = train_metadata.sort_values("file_name").groupby("seq_id").first()

Now, let's look at the labels. Each sequence label is a one-hot-encoded row vector with a 1 in the coloumn corresponding to a species that is present, and a 0 otherwise. Each row corresponds to a unique sequence ID.

In [9]:

train_labels.head()

Out[9]:

	aardvark	aardwolf	baboon	bat	batearedfox	buffalo	bushbuck	caracal	cattle	cheetah	...	serval	steenbok	topi	vulture	warthog	waterbuck	wildcat	wildebeest	zebra	zorilla
seq_id
SER_S1#B04#1#10	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
SER_S1#B04#1#11	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
SER_S1#B04#1#12	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
SER_S1#B04#1#13	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
SER_S1#B04#1#14	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 54 columns

Though most sequences have only one animal in them, it is possible to mutliple animals in a single sequence.

In [10]:

train_labels.sum(axis=1).value_counts()

Out[10]:

1    2428805
2      25667
3        223
4          2
dtype: int64

Notice below that not only is empty a category, but it's the most dominant category––by far! This is because, as useful as camera traps are, they tend to register many false positives. For example, they are often triggered by wind blowing plant life around, or fluctuations of heat and light in the surrounding area.

In [11]:

train_labels.mean(axis=0).sort_values(ascending=False)

Out[11]:

empty              0.759851
wildebeest         0.076248
zebra              0.053718
gazellethomsons    0.037591
buffalo            0.009228
elephant           0.008200
hartebeest         0.008055
impala             0.006635
gazellegrants      0.006394
giraffe            0.006195
warthog            0.005640
guineafowl         0.004460
otherbird          0.004275
hyenaspotted       0.004108
lionfemale         0.002366
eland              0.002281
hippopotamus       0.002017
reedbuck           0.001916
topi               0.001663
baboon             0.001154
dikdik             0.001125
cheetah            0.000888
lionmale           0.000710
secretarybird      0.000695
serval             0.000551
jackal             0.000531
ostrich            0.000509
koribustard        0.000506
hare               0.000364
aardvark           0.000310
insectspider       0.000304
batearedfox        0.000245
monkeyvervet       0.000235
waterbuck          0.000232
porcupine          0.000231
mongoose           0.000230
aardwolf           0.000171
bushbuck           0.000160
leopard            0.000158
hyenastriped       0.000089
reptiles           0.000062
caracal            0.000047
rodents            0.000040
wildcat            0.000039
honeybadger        0.000033
vulture            0.000033
duiker             0.000032
civet              0.000030
genet              0.000028
rhinoceros         0.000027
zorilla            0.000016
cattle             0.000009
steenbok           0.000004
bat                0.000002
dtype: float64

Now that we have a sense of what the data means, let's get access to the images themselves!

Dealing With Large Amounts of Data¶

Since there are millions of high-quality images, unzipping all the data takes a long time. We recommend starting your model development with a single season or two of data while the rest of the data downloads. In this benchmark, we'll work with seasons 1 and 3.

In [12]:

train_metadata = train_metadata[train_metadata.season.isin(["SER_S1", "SER_S3"])]
train_labels = train_labels[train_labels.index.isin(train_metadata.index)]

Add Full Image Path to the Data¶

Our local data is mounted under the /databig/raw directory in folders that match the name of the zipfile.

In [13]:

IMAGE_DIR = Path("/databig/raw")

We'll convert the file_name column to a Path object with the full path to our data.

In [14]:

train_metadata["file_name"] = train_metadata.apply(
    lambda x: (IMAGE_DIR / f"SnapshotSerengeti_S0{x.season[-1]}_v2_0" / x.file_name),
    axis=1,
)

In [15]:

train_metadata.head()

Out[15]:

	file_name	datetime	season
seq_id
SER_S1#B04#1#10	/databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0010.JPG	2010-07-30 05:24:50	SER_S1
SER_S1#B04#1#11	/databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0011.JPG	2010-07-30 20:54:16	SER_S1
SER_S1#B04#1#12	/databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0012.JPG	2010-07-30 20:57:28	SER_S1
SER_S1#B04#1#13	/databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0013.JPG	2010-08-01 17:35:58	SER_S1
SER_S1#B04#1#14	/databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0014.JPG	2010-08-02 11:43:14	SER_S1

Before we get into the weeds of modeling, let's take a quick break to look at some of animals in this data!

In [16]:

from IPython.display import Image


def look_at_random_animal(name, random_state, width=500):
    seq_ids = train_labels[train_labels[name] == 1].index
    file_names = train_metadata.loc[seq_ids].file_name
    filename = file_names.sample(random_state=random_state).values[0]
    return Image(filename=str(filename), width=width)

A particular animal that catches our eye is the ... zorilla? Let's check it out.

In [17]:

look_at_random_animal("zorilla", random_state=111)

Out[17]:

Cute. What else?

In [18]:

look_at_random_animal("wildebeest", random_state=101)

Out[18]:

Wow! Ok maybe just one more...

In [19]:

look_at_random_animal("lionfemale", random_state=2019)

Out[19]:

Looks like someone wants a belly rub!

Ok, we really should get back to work.

Splitting the Data in a Reasonable Way¶

As we mentioned above, we know the test set for this comeptition involves seasons we have no access to. For training, however, we do have access to metadata and downloads for 10 seasons. We should set aside some portion of these seasons to validate our model during development. That way we are less likely to see misleading validation results due to overfitting one or more seasons.

In this benchmark, we're going to use season 1 for training, and season 3 for validation. You'll likely want to use multiple seasons for training and validation in your model.

In [20]:

train_seasons = ["SER_S1"]
val_seasons = ["SER_S3"]

# split out validation first
val_x = train_metadata[train_metadata.season.isin(val_seasons)]
val_y = train_labels[train_labels.index.isin(val_x.index)]

# reduce training
train_metadata = train_metadata[train_metadata.season.isin(train_seasons)]
train_labels = train_labels[train_labels.index.isin(train_metadata.index)]

Using A Data Generator¶

We have way too many images to load into memory. We're going to need a data generator that can stream data to our model. Recall that to simplify our first pass solution we aren't using the whole sequence, just the first image in each sequence. That allows us to treat the problem as a standard image classification probablem, which is fine for this benchmark but you'll probably want to generalize the approach to include all the information from the sequences!

Because we've reduced the problem to a simpler one, we can use the standard ImageDataGenerator packaged with Keras.

In [21]:

from tensorflow.keras.preprocessing.image import ImageDataGenerator

One of the many cool things about this generator is that it has a method called .flow_from_dataframe, allowing us to get batches of images using a dataframe that includes paths to the files and the labels. First we'll need to join the file paths the the labels.

In [22]:

train_gen_df = train_labels.join(train_metadata.file_name.apply(lambda path: str(path)))
val_gen_df = val_y.join(val_x.file_name.apply(lambda path: str(path)))
label_columns = train_labels.columns.tolist()

Next we instantiate the generators, one for training and one for validation data. Once our model is compiled, we'll use the .fit_generator method instead of .fit to train, passing in the data generators instead of the traditional x_train, y_train, x_val, y_val. The generators will then give the model one batch at a time so that we can worry less about overloading the memory with too many awesome camera trap shots!

...Like this one!

In [23]:

look_at_random_animal("zebra", random_state=1111111)

Out[23]:

Anyways, since we have so many blanks we're going to lazily balance our training set by dropping most of them and avoid predicting blank for everything. This is a start, but you'll probably want to delop a better approach for avoiding an overfit model.

In [24]:

# drop 90% of blank to avoid over-predicintg blank
to_drop_train = (
    train_gen_df[(train_gen_df["empty"] == 1)].sample(frac=0.9, random_state=123).index
)
train_gen_df = train_gen_df.drop(to_drop_train)

train_gen_df.mean(axis=0).sort_values(ascending=False)

Out[24]:

empty              0.340521
gazellethomsons    0.290418
zebra              0.052576
gazellegrants      0.047463
guineafowl         0.030484
hyenaspotted       0.026844
warthog            0.026677
otherbird          0.024120
hartebeest         0.019980
giraffe            0.019285
elephant           0.016673
lionfemale         0.013978
buffalo            0.013672
reedbuck           0.009115
dikdik             0.007281
impala             0.007253
wildebeest         0.006475
lionmale           0.006197
koribustard        0.005641
cheetah            0.005474
topi               0.004613
hippopotamus       0.003501
baboon             0.003474
reptiles           0.003307
hare               0.003140
jackal             0.003001
batearedfox        0.002918
mongoose           0.002723
ostrich            0.002418
secretarybird      0.001834
serval             0.001612
aardvark           0.001334
rodents            0.001195
porcupine          0.001112
eland              0.001084
hyenastriped       0.001056
aardwolf           0.000834
monkeyvervet       0.000611
caracal            0.000528
wildcat            0.000445
leopard            0.000389
honeybadger        0.000389
bushbuck           0.000333
genet              0.000278
waterbuck          0.000250
civet              0.000250
zorilla            0.000083
rhinoceros         0.000056
steenbok           0.000000
vulture            0.000000
bat                0.000000
duiker             0.000000
insectspider       0.000000
cattle             0.000000
dtype: float64

That looks at least like a 2 class problem. Note we didn't change the validation set's class balance, so we should be able to get a fair sense of how our model generalizes.

Let's use this data to set up the generators.

In [25]:

# This must be set to load some imags using PIL, which Keras uses.
from PIL import ImageFile

ImageFile.LOAD_TRUNCATED_IMAGES = True

# The pretrained model we'll use, explained more below.
# We pass its preprocessing function to our data generator.
from tensorflow.keras.applications import nasnet

# This will be the input size to our model.
target_size = (224, 224)
batch_size = 128

# Note that we pass the preprocessing function here
datagen = ImageDataGenerator(preprocessing_function=nasnet.preprocess_input)

train_datagen = datagen.flow_from_dataframe(
    dataframe=train_gen_df,
    x_col="file_name",
    y_col=label_columns,
    class_mode="other",
    target_size=target_size,
    batch_size=batch_size,
    shuffle=True,
)
val_datagen = datagen.flow_from_dataframe(
    dataframe=val_gen_df,
    x_col="file_name",
    y_col=label_columns,
    class_mode="other",
    target_size=target_size,
    batch_size=batch_size,
    shuffle=True,
)

Found 35986 images.
Found 145873 images.

Ok! We're now ready to create the model.

Creating the model¶

We want to keep the model reatlively simple for a first pass, adding complexity only after we have tested a basic approach. As mentioned above, because we only consider the first image of each sequence, we don't need to (more like we don't get to) consider any of the exicting complications that arise from sequence modeling. Instead, we can treat the problem as a standard image classification problem.

We're going to use a fixed feature extractor approach to transfer learning. This just means that we're going to use a pretrained model but freeze all the weights inside of the model and swap out the classifier at the top of the model. In tensorflow.keras this can be done in just a few lines as follows:

Import a pretrained model without its classification layer
Turn off the pretrained trainable attributes for faster training
Collect the outputs of the pretrained model and pass them along to additional layers, or the classifier.
Add a classification layer that matches our problem

Note: This modeling approach could be extended to handle sequences but you'll probably have to write your own generator in addition to using, for example, tensorflow.keras.layers.TimeDistributed layers, but that exercise is left to the reader!

In [26]:

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, GlobalMaxPooling2D, Input, Lambda


def get_transfer_model(
    model_to_transfer, num_classes, img_height, img_width, num_channels=3
):
    inputs = Input(shape=(img_height, img_width, num_channels), name="inputs")

    # instantiate the model without the top (classifier at the end)
    transfer = model_to_transfer(include_top=False)

    # freeze layer weights in the transfer model to speed up training
    for layer in transfer.layers:
        layer.trainable = False

    transfer_out = transfer(inputs)
    pooled = GlobalMaxPooling2D(name="pooling")(transfer_out)
    drop_out = Dropout(0.2, name="dropout_1")(pooled)
    dense = Dense(256, activation="relu", name="dense")(drop_out)
    drop_out = Dropout(0.2, name="dropout_2")(dense)
    outputs = Dense(num_classes, activation="softmax", name="classifer")(drop_out)
    model = Model(inputs=inputs, outputs=outputs)
    return model

There are lots of pretrained models to choose from. We're going to choose NasNet, simply because it demonstrates state-of-the-art performance on the ImageNet dataset. Although the top-performing model is NASNetLarge, we're going to speed things up a bit by using the smaller NASNetMobile.

In [27]:

from tensorflow.keras.applications import nasnet

model = get_transfer_model(
    model_to_transfer=nasnet.NASNetMobile,
    num_classes=train_labels.shape[1],
    img_height=target_size[0],
    img_width=target_size[1],
)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
inputs (InputLayer)          (None, 224, 224, 3)       0         
_________________________________________________________________
NASNet (Model)               (None, 7, 7, 1056)        4269716   
_________________________________________________________________
pooling (GlobalMaxPooling2D) (None, 1056)              0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 1056)              0         
_________________________________________________________________
dense (Dense)                (None, 256)               270592    
_________________________________________________________________
dropout_2 (Dropout)          (None, 256)               0         
_________________________________________________________________
classifer (Dense)            (None, 54)                13878     
=================================================================
Total params: 4,554,186
Trainable params: 284,470
Non-trainable params: 4,269,716
_________________________________________________________________

We'll compile the model using a standard optimization with respect to categorical crossentropy loss. For metrics, in addition to loss we'll consider two versions of top-K accuracy. The K just means "the correct prediction was in the top K most-probable classes." We're used to top-1 accuracy, but in image problems, top-5 is typically considered as well. Top-5 accuracy can help us see if the right labels are "bubbling up" towards higher probability.

In [28]:

from tensorflow.keras.metrics import top_k_categorical_accuracy, categorical_crossentropy

metrics = ["acc", top_k_categorical_accuracy, categorical_crossentropy]
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=metrics)

Time to train! We'll train on about 13,000 examples and validate on half of that.

In [29]:

model.fit_generator(
    train_datagen,
    steps_per_epoch=100,
    validation_data=val_datagen,
    validation_steps=50,
    workers=12,
)

Epoch 1/1
100/100 [==============================] - 367s 4s/step - loss: 3.2300 - acc: 0.4326 - top_k_categorical_accuracy: 0.7342 - categorical_crossentropy: 3.2300 - val_loss: 2.0314 - val_acc: 0.3478 - val_top_k_categorical_accuracy: 0.8116 - val_categorical_crossentropy: 2.0314

Out[29]:

<tensorflow.python.keras.callbacks.History at 0x7f44822f1e48>

Make Submission¶

Submissions for this competition work in a different way than usual in that the inference code is actually executed to generate the score. This means we'll need to

save our model
factor our data loading and preprocessing code
write inference code to create a submission
package everything up and upload to the DrivenData + Microsoft execution environment in the cloud!

The key part of the submission process is the main.py file which loads the model and weights, performs inference on the test data, and saves out predictions to submission.csv. You can see this benchmark's example main.py file here.

Let's walk through the process. Below we will

Put all files needed to generate a submission in a folder called inference/. Our main.py script will live at the root level. The main.py script will be run in our cloud-based execution environment, which expects a submission.csv to be generated at the root level (next to main.py) when inference is complete.
To support the inference process, we'll use a subdirectory of inference/ called assets/ to store
- our trained model weights
- the submission format, used to check the validitiy of our submission
- the test metadata, used by our data generator to fetch batches of data to perform inference
We'll zip the contents of the inference directory (not the directory itself) into a file called submission.zip.
When we upload submission.zip, the file will be unzipped and main.py will be run in a gpu-enabled Docker container with many data science libraries installed. The runtime environment can be seen here.

First, let's make the inference directory and its assets subdirectory.

In [30]:

!mkdir -p inference inference/assets

Now let's save our model!

In [31]:

model.save("inference/assets/my_awesome_model.h5")

We also need to make sure the script has access to the submission format and test metadata, so we copy test_metadata.csv and submission_format.csv into the assets folder as well.

In [32]:

!cp ../data/final/public/test_metadata.csv inference/assets/

In [33]:

!cp ../data/final/public/submission_format.csv inference/assets/

Below we will paste the entirety of our main.py script. You may want to study the script and comments closely. A few things to notice

DATA. During execution, our data lives in the cloud mounted at inference/data/. We'll need to use the test_metadata.csv and the cloud data path to construct the full paths to be used by our generator
WORKERS. Our model.predict_generator method uses the workers parameter to use multiple cores and threads for data loading and inference generation. This improves inference speed by more than a factor of 10 compared to using single core default settings. You'll need to utilize parallelism or other efficiency measures to meet the execution time requirement whihc requires your inference process to take no more than TIME hours.
PREPROCESSING. Notice that we import our preprocessing function tensorflow.keras.applications.nasnet.preprocess_input. This is crucial for performance consistence with training so make sure your preprocessing code is available to main.py!

Let's take a look at the contents we'll be zipping up, then you can study the main.py script.

In [2]:

!tree inference/

inference/
├── assets
│   └── my_awesome_model.h5
└── main.py

1 directory, 2 files

Great! Read over main.py below and read on to check out the results of our submission.

Our `main.py` submission script¶

from datetime import datetime
import logging
import multiprocessing
from pathlib import Path

import cv2
import numpy as np
from PIL import ImageFile
import pandas as pd
from tensorflow.keras.applications import nasnet
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# We get to see the log output for our execution, so log away!
logging.basicConfig(level=logging.INFO)

# This must be set to load some imags using PIL, which Keras uses.
ImageFile.LOAD_TRUNCATED_IMAGES = True

ASSET_PATH = Path(__file__).parents[0] / "assets"
MODEL_PATH = ASSET_PATH / "my_awesome_model.h5"

# the images will live in a folder called 'data' in the container
DATA_PATH = Path(__file__).parents[0] / "data"


def perform_inference():
    """This is the main function executed at runtime in the cloud environment. """
    logging.info("Loading model.")
    model = load_model(MODEL_PATH)

    logging.info("Loading and processing metadata.")

    # our preprocessing selects the first image for each sequence
    test_metadata = pd.read_csv(DATA_PATH / "test_metadata.csv", index_col="seq_id")
    test_metadata = (
        test_metadata.sort_values("file_name").groupby("seq_id").first().reset_index()
    )

    # prepend the path to our filename since our data lives in a separate folder
    test_metadata["full_path"] = test_metadata.file_name.map(
        lambda x: str(DATA_PATH / x)
    )

    logging.info("Starting inference.")

    # Preallocate prediction output
    submission_format = pd.read_csv(DATA_PATH / "submission_format.csv", index_col=0)
    num_labels = submission_format.shape[1]
    output = np.zeros((test_metadata.shape[0], num_labels))

    # Instantiate test data generator
    datagen = ImageDataGenerator(preprocessing_function=nasnet.preprocess_input)

    batch_size = 256
    test_datagen = datagen.flow_from_dataframe(
        dataframe=test_metadata,
        x_col="full_path",
        y_col=None,
        class_mode=None,
        target_size=(224, 224),
        batch_size=batch_size,
        shuffle=False,
    )

    # Perform (and time) inference
    steps = np.ceil(test_metadata.shape[0] / batch_size)
    inference_start = datetime.now()
    preds = model.predict_generator(
        test_datagen, steps=steps, verbose=1, workers=12, use_multiprocessing=False
    )
    inference_stop = datetime.now()
    logging.info(f"Inference complete. Took {inference_stop - inference_start}.")

    logging.info("Creating submission.")

    # Check our predictions are in the same order as the submission format
    assert np.all(
        test_metadata.seq_id.unique().tolist() == submission_format.index.to_list()
    )

    output[: preds.shape[0], :] = preds[: output.shape[0], :]
    my_submission = pd.DataFrame(
        np.stack(output),
        # remember that we are predicting at the sequence, not image level
        index=test_metadata.seq_id,
        columns=submission_format.columns,
    )

    # We want to ensure all of our data are floats, not integers
    my_submission = my_submission.astype(np.float)

    # Save out submission to root of directory
    my_submission.to_csv("submission.csv", index=True)
    logging.info(f"Submission saved.")


if __name__ == "__main__":
    perform_inference()

Now we'll zip everything up, submit it, and get our score!

In [3]:

# remember to avoid including the inference dir itself
!cd inference; zip -r ../submission.zip *

  adding: assets/ (stored 0%)
  adding: assets/my_awesome_model.h5 (stored 0%)
  adding: main.py (deflated 59%)

Upload submission¶

Alright! We're ready to submit!

Our submission takes about 20 minutes to execute. Once it's complete, we can see our score id around 0.05! Awesome!

Hopefully this benchmark helps you understand the submission process for this exciting new type of DrivenData competition.

We can't wait to see what you come up with! Happy importing!

Oh, and don't forget:

In [36]:

look_at_random_animal("giraffe", random_state=1111)

Out[36]:

In [ ]:

How to Use Deep Learning to Identify Wildlife

Benchmark - Hakuna Ma-data: Identify Wildlife on the Serengeti with AI for Earth¶

Data Exploration¶

Dealing With Large Amounts of Data¶

Add Full Image Path to the Data¶

Splitting the Data in a Reasonable Way¶

Using A Data Generator¶

Creating the model¶

Make Submission¶

Our `main.py` submission script¶

Upload submission¶

Tags

Latest posts

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

Community Spotlight: Kirill Brodt

A production application to support survivors of human trafficking

Life beyond the leaderboard

(Tech) Infrastructure Week for the Nonprofit Sector

Meet the winners of Phase 2 of the PREPARE Challenge

AI sauce on everything: Reflections on ASU+GSV 2025

Open-source packages for using speech data in ML

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Crowdsourcing solutions for AI-assisted early literacy screening

Where to find a data job for a good cause

Meet the Winners of the Youth Mental Health Narratives Challenge

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

10 takeaways from 10 years of data science for social good

Goodnight Moon, Hello Early Literacy Screening Benchmark

Youth Mental Health: Automated Abstraction Benchmark

Meet the winners of Phase 1 of the PREPARE Challenge

Teaching with DrivenData Competitions

What a non-profit shutting down tells us about AI in the social sector

Work with us to build a better world

Benchmark - Hakuna Ma-data: Identify Wildlife on the Serengeti with AI for Earth¶

Data Exploration¶

Dealing With Large Amounts of Data¶

Add Full Image Path to the Data¶

Splitting the Data in a Reasonable Way¶

Using A Data Generator¶

Creating the model¶

Make Submission¶

Our main.py submission script¶

Upload submission¶

Tags

Stay updated

Latest posts

Work with us to build a better world

Our `main.py` submission script¶