by
Casey Fitzpatrick
Benchmark - Hakuna Ma-data: Identify Wildlife on the Serengeti with AI for Earth¶
Welcome to the benchmark solution tutorial for our new competition in partnership with AI for Earth! In this computer-vision competition, you are tasked with identifying animal species in camera trap footage. The training data consists of over 2.5 million sequences of images collected using camera traps placed in the Serengeti region of Africa. The sequences are one-hot-labeled for 53 different species groups, or as empty
. For each sequence (which may be multiple images), you will generate a submission that consists of probabilities for each possible class.
But the fun doesn't stop there! In this competition, you will not be submitting a csv of predictions. Instead, you will submit the code that performs inference on the test data, and we will execute that code in the cloud to generate and score your submission.
In this benchmark, we'll walk through a first-pass approach to loading, understanding, and preparing the data. We'll use an out-of-the-box transfer learning approach to train a Keras model on a subset of the training data. Then we'll explain how to package up your model and submit a file capable of running in our cloud-based execution environment. With all these pipes connected, you'll be ready to hit the ground running.
We've got a large territory to cover, so let's get started!
Data Exploration¶
Our training set consists of 10 "seasons" of footage. We can get information about each seson's sequences and labels using the training metadata, which provides links between image flenames and sequences, as well as the training labels, which tell us what's in a given sequence.
You can download the metadata as well as the actual image data here. Keep in mind, this is a very large training set, so make sure you have a storage solution. The complete set of images for all 10 seasons is nearly 5 TB! But the image data is split by season, so you can download image files one season at a time. For this benchmark, we're only going to work with a couple of seasons.
Let's load the metadata and look at what we have.
import json
from pathlib import Path
import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 80)
# This is where our downloaded images and metadata live locally
DATA_PATH = Path.cwd().parent / "data/final/public/"
train_metadata = pd.read_csv(DATA_PATH / "train_metadata.csv")
train_labels = pd.read_csv(DATA_PATH / "train_labels.csv", index_col="seq_id")
train_metadata.head()
Number of images in the train set.
train_metadata.shape[0]
That's a lot of images! However, each image is associated to a sequence. And our predictions will be at the sequence level.
A sequence is an ordered series, in this case of camera trap images ordered in time. When the camera trap is triggered, it often takes more than one image, yielding a an image sequence. In the data, each sequence has a seq_id
, which is the index for the labels. The order of images in a given sequence can be inferred from the last four digits in the filename.
Since our predictions are made at the sequence level, this means that there is one label for any given sequence. Imagine that lion walks by and triggers the camera trap, which takes 4 pictures. If the lion keeps walking, they may only appear in the first three pictures in the sequence and be out of frame by the time the 4th image is taken. Despite the 4th frame being empty of any lion, that sequence of 4 images would still be labeled as lion
.
Let's confirm that each label in train_labels
corresponds to a unique seq_id
.
assert train_metadata.seq_id.nunique() == train_labels.index.shape[0]
# number of sequences
train_metadata.seq_id.nunique()
We have different seasons in the training set, and different numbers of images per season. We can see which season an image belongs to by looking at the first few characters of the sequence ID.
train_metadata['season'] = train_metadata.seq_id.map(lambda x: x.split('#')[0])
train_metadata.season.value_counts().sort_index()
Keep in mind that the test set comes from seasons not represented in the training set. So our model needs to generalize to seasons it hasn't seen before.
As we can see, location values are not unique between seasons. So while our model will need to generalize across seasons, it might get to "revisit" the same location from season to season.
Below we see that the number of images we have for each sequence varies, but by far most sequences have between 1 and 3 images in them.
train_metadata.groupby('seq_id').size().value_counts().sort_index()
For this benchmark, we're going to simplify the problem by taking only the first image from each sequence. The justifying assumption here is that the period of time immediately after a camera trap is first triggered is the most likely time to see an animal in frame. However, this may not always be the case and ultimately you'll probably want to give your model as much information as you can by using more images per sequence.
#reduce to first frame only for all sequences
train_metadata = train_metadata.sort_values('file_name').groupby('seq_id').first()
Now, let's look at the labels. Each sequence label is a one-hot-encoded row vector with a 1
in the coloumn corresponding to a species that is present, and a 0
otherwise. Each row corresponds to a unique sequence ID.
train_labels.head()
Though most sequences have only one animal in them, it is possible to mutliple animals in a single sequence.
train_labels.sum(axis=1).value_counts()
Notice below that not only is empty
a category, but it's the most dominant category––by far! This is because, as useful as camera traps are, they tend to register many false positives. For example, they are often triggered by wind blowing plant life around, or fluctuations of heat and light in the surrounding area.
train_labels.mean(axis=0).sort_values(ascending=False)
Now that we have a sense of what the data means, let's get access to the images themselves!
Dealing With Large Amounts of Data¶
Since there are millions of high-quality images, unzipping all the data takes a long time. We recommend starting your model development with a single season or two of data while the rest of the data downloads. In this benchmark, we'll work with seasons 1 and 3.
train_metadata = train_metadata[train_metadata.season.isin(['SER_S1', 'SER_S3'])]
train_labels = train_labels[train_labels.index.isin(train_metadata.index)]
Add Full Image Path to the Data¶
Our local data is mounted under the /databig/raw
directory in folders that match the name of the zipfile.
IMAGE_DIR = Path("/databig/raw")
We'll convert the file_name
column to a Path
object with the full path to our data.
train_metadata['file_name'] = train_metadata.apply(
lambda x: (IMAGE_DIR / f'SnapshotSerengeti_S0{x.season[-1]}_v2_0' / x.file_name), axis=1
)
train_metadata.head()
Before we get into the weeds of modeling, let's take a quick break to look at some of animals in this data!
from IPython.display import Image
def look_at_random_animal(name, random_state, width=500):
seq_ids = train_labels[train_labels[name] == 1].index
file_names = train_metadata.loc[seq_ids].file_name
filename = file_names.sample(random_state=random_state).values[0]
return Image(filename=str(filename), width=width)
A particular animal that catches our eye is the ... zorilla
? Let's check it out.
look_at_random_animal("zorilla", random_state=111)
Cute. What else?
look_at_random_animal("wildebeest", random_state=101)
Wow! Ok maybe just one more...
look_at_random_animal("lionfemale", random_state=2019)
Looks like someone wants a belly rub!
Ok, we really should get back to work.
Splitting the Data in a Reasonable Way¶
As we mentioned above, we know the test set for this comeptition involves seasons we have no access to. For training, however, we do have access to metadata and downloads for 10 seasons. We should set aside some portion of these seasons to validate our model during development. That way we are less likely to see misleading validation results due to overfitting one or more seasons.
In this benchmark, we're going to use season 1 for training, and season 3 for validation. You'll likely want to use multiple seasons for training and validation in your model.
train_seasons = ["SER_S1"]
val_seasons = ["SER_S3"]
# split out validation first
val_x = train_metadata[train_metadata.season.isin(val_seasons)]
val_y = train_labels[train_labels.index.isin(val_x.index)]
# reduce training
train_metadata = train_metadata[train_metadata.season.isin(train_seasons)]
train_labels = train_labels[train_labels.index.isin(train_metadata.index)]
Using A Data Generator¶
We have way too many images to load into memory. We're going to need a data generator that can stream data to our model. Recall that to simplify our first pass solution we aren't using the whole sequence, just the first image in each sequence. That allows us to treat the problem as a standard image classification probablem, which is fine for this benchmark but you'll probably want to generalize the approach to include all the information from the sequences!
Because we've reduced the problem to a simpler one, we can use the standard ImageDataGenerator
packaged with Keras.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
One of the many cool things about this generator is that it has a method called .flow_from_dataframe
, allowing us to get batches of images using a dataframe that includes paths to the files and the labels. First we'll need to join the file paths the the labels.
train_gen_df = train_labels.join(train_metadata.file_name.apply(lambda path: str(path)))
val_gen_df = val_y.join(val_x.file_name.apply(lambda path: str(path)))
label_columns = train_labels.columns.tolist()
Next we instantiate the generators, one for training and one for validation data. Once our model is compiled, we'll use the .fit_generator
method instead of .fit
to train, passing in the data generators instead of the traditional x_train, y_train, x_val, y_val
. The generators will then give the model one batch at a time so that we can worry less about overloading the memory with too many awesome camera trap shots!
...Like this one!
look_at_random_animal("zebra", random_state=1111111)
Anyways, since we have so many blanks we're going to lazily balance our training set by dropping most of them and avoid predicting blank for everything. This is a start, but you'll probably want to delop a better approach for avoiding an overfit model.
# drop 90% of blank to avoid over-predicintg blank
to_drop_train = train_gen_df[(train_gen_df["empty"] == 1)].sample(frac=.9, random_state=123).index
train_gen_df = train_gen_df.drop(to_drop_train)
train_gen_df.mean(axis=0).sort_values(ascending=False)
That looks at least like a 2 class problem. Note we didn't change the validation set's class balance, so we should be able to get a fair sense of how our model generalizes.
Let's use this data to set up the generators.
# This must be set to load some imags using PIL, which Keras uses.
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
# The pretrained model we'll use, explained more below.
# We pass its preprocessing function to our data generator.
from tensorflow.keras.applications import nasnet
# This will be the input size to our model.
target_size = (224, 224)
batch_size = 128
# Note that we pass the preprocessing function here
datagen = ImageDataGenerator(
preprocessing_function=nasnet.preprocess_input
)
train_datagen = datagen.flow_from_dataframe(
dataframe=train_gen_df,
x_col="file_name",
y_col=label_columns,
class_mode="other",
target_size=target_size,
batch_size=batch_size,
shuffle=True,
)
val_datagen = datagen.flow_from_dataframe(
dataframe=val_gen_df,
x_col="file_name",
y_col=label_columns,
class_mode="other",
target_size=target_size,
batch_size=batch_size,
shuffle=True,
)
Ok! We're now ready to create the model.
Creating the model¶
We want to keep the model reatlively simple for a first pass, adding complexity only after we have tested a basic approach. As mentioned above, because we only consider the first image of each sequence, we don't need to (more like we don't get to) consider any of the exicting complications that arise from sequence modeling. Instead, we can treat the problem as a standard image classification problem.
We're going to use a fixed feature extractor approach to transfer learning. This just means that we're going to use a pretrained model but freeze all the weights inside of the model and swap out the classifier at the top of the model. In tensorflow.keras
this can be done in just a few lines as follows:
- Import a pretrained model without its classification layer
- Turn off the pretrained
trainable
attributes for faster training - Collect the outputs of the pretrained model and pass them along to additional layers, or the classifier.
- Add a classification layer that matches our problem
Note: This modeling approach could be extended to handle sequences but you'll probably have to write your own generator in addition to using, for example, tensorflow.keras.layers.TimeDistributed
layers, but that exercise is left to the reader!
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Dropout, GlobalMaxPooling2D, Input, Lambda
def get_transfer_model(model_to_transfer, num_classes, img_height, img_width, num_channels=3):
inputs = Input(shape=(img_height, img_width, num_channels), name="inputs")
# instantiate the model without the top (classifier at the end)
transfer = model_to_transfer(include_top=False)
# freeze layer weights in the transfer model to speed up training
for layer in transfer.layers:
layer.trainable = False
transfer_out = transfer(inputs)
pooled = GlobalMaxPooling2D(name="pooling")(transfer_out)
drop_out = Dropout(0.2, name="dropout_1")(pooled)
dense = Dense(256, activation="relu", name="dense")(drop_out)
drop_out = Dropout(0.2, name="dropout_2")(dense)
outputs = Dense(num_classes, activation="softmax", name="classifer")(drop_out)
model = Model(inputs=inputs, outputs=outputs)
return model
There are lots of pretrained models to choose from. We're going to choose NasNet, simply because it demonstrates state-of-the-art performance on the ImageNet dataset. Although the top-performing model is NASNetLarge
, we're going to speed things up a bit by using the smaller NASNetMobile
.
from tensorflow.keras.applications import nasnet
model = get_transfer_model(
model_to_transfer=nasnet.NASNetMobile,
num_classes=train_labels.shape[1],
img_height=target_size[0],
img_width=target_size[1],
)
model.summary()
We'll compile the model using a standard optimization with respect to categorical crossentropy loss. For metrics, in addition to loss we'll consider two versions of top-K accuracy. The K just means "the correct prediction was in the top K most-probable classes." We're used to top-1 accuracy, but in image problems, top-5 is typically considered as well. Top-5 accuracy can help us see if the right labels are "bubbling up" towards higher probability.
from tensorflow.keras.metrics import top_k_categorical_accuracy, categorical_crossentropy
metrics=["acc", top_k_categorical_accuracy, categorical_crossentropy]
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=metrics)
Time to train! We'll train on about 13,000 examples and validate on half of that.
model.fit_generator(
train_datagen,
steps_per_epoch=100,
validation_data=val_datagen,
validation_steps=50,
workers=12,
)
Make Submission¶
Submissions for this competition work in a different way than usual in that the inference code is actually executed to generate the score. This means we'll need to
- save our model
- factor our data loading and preprocessing code
- write inference code to create a submission
- package everything up and upload to the DrivenData + Microsoft execution environment in the cloud!
The key part of the submission process is the main.py
file which loads the model and weights, performs inference on the test data, and saves out predictions to submission.csv
. You can see this benchmark's example main.py
file here.
Let's walk through the process. Below we will
Put all files needed to generate a submission in a folder called
inference/
. Ourmain.py
script will live at the root level. Themain.py
script will be run in our cloud-based execution environment, which expects asubmission.csv
to be generated at the root level (next tomain.py
) when inference is complete.To support the inference process, we'll use a subdirectory of
inference/
calledassets/
to store- our trained model weights
- the submission format, used to check the validitiy of our submission
- the test metadata, used by our data generator to fetch batches of data to perform inference
We'll
zip
the contents of theinference
directory (not the directory itself) into a file calledsubmission.zip
.- When we upload
submission.zip
, the file will be unzipped andmain.py
will be run in a gpu-enabled Docker container with many data science libraries installed. The runtime environment can be seen here.
First, let's make the inference
directory and its assets
subdirectory.
!mkdir -p inference inference/assets
Now let's save our model!
model.save('inference/assets/my_awesome_model.h5')
We also need to make sure the script has access to the submission format and test metadata, so we copy test_metadata.csv
and submission_format.csv
into the assets
folder as well.
!cp ../data/final/public/test_metadata.csv inference/assets/
!cp ../data/final/public/submission_format.csv inference/assets/
Below we will paste the entirety of our main.py
script. You may want to study the script and comments closely. A few things to notice
- DATA. During execution, our data lives in the cloud mounted at
inference/data/
. We'll need to use thetest_metadata.csv
and the cloud data path to construct the full paths to be used by our generator - WORKERS. Our
model.predict_generator
method uses theworkers
parameter to use multiple cores and threads for data loading and inference generation. This improves inference speed by more than a factor of 10 compared to using single core default settings. You'll need to utilize parallelism or other efficiency measures to meet the execution time requirement whihc requires your inference process to take no more than TIME hours. - PREPROCESSING. Notice that we import our preprocessing function
tensorflow.keras.applications.nasnet.preprocess_input
. This is crucial for performance consistence with training so make sure your preprocessing code is available tomain.py
!
Let's take a look at the contents we'll be zipping up, then you can study the main.py
script.
!tree inference/
Great! Read over main.py
below and read on to check out the results of our submission.
Our main.py
submission script¶
from datetime import datetime
import logging
import multiprocessing
from pathlib import Path
import cv2
import numpy as np
from PIL import ImageFile
import pandas as pd
from tensorflow.keras.applications import nasnet
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# We get to see the log output for our execution, so log away!
logging.basicConfig(level=logging.INFO)
# This must be set to load some imags using PIL, which Keras uses.
ImageFile.LOAD_TRUNCATED_IMAGES = True
ASSET_PATH = Path(__file__).parents[0] / "assets"
MODEL_PATH = ASSET_PATH / "my_awesome_model.h5"
# the images will live in a folder called 'data' in the container
DATA_PATH = Path(__file__).parents[0] / "data"
def perform_inference():
"""This is the main function executed at runtime in the cloud environment. """
logging.info("Loading model.")
model = load_model(MODEL_PATH)
logging.info("Loading and processing metadata.")
# our preprocessing selects the first image for each sequence
test_metadata = pd.read_csv(DATA_PATH / "test_metadata.csv", index_col="seq_id")
test_metadata = (
test_metadata.sort_values("file_name").groupby("seq_id").first().reset_index()
)
# prepend the path to our filename since our data lives in a separate folder
test_metadata["full_path"] = test_metadata.file_name.map(
lambda x: str(DATA_PATH / x)
)
logging.info("Starting inference.")
# Preallocate prediction output
submission_format = pd.read_csv(DATA_PATH / "submission_format.csv", index_col=0)
num_labels = submission_format.shape[1]
output = np.zeros((test_metadata.shape[0], num_labels))
# Instantiate test data generator
datagen = ImageDataGenerator(preprocessing_function=nasnet.preprocess_input)
batch_size = 256
test_datagen = datagen.flow_from_dataframe(
dataframe=test_metadata,
x_col="full_path",
y_col=None,
class_mode=None,
target_size=(224, 224),
batch_size=batch_size,
shuffle=False,
)
# Perform (and time) inference
steps = np.ceil(test_metadata.shape[0] / batch_size)
inference_start = datetime.now()
preds = model.predict_generator(
test_datagen, steps=steps, verbose=1, workers=12, use_multiprocessing=False
)
inference_stop = datetime.now()
logging.info(f"Inference complete. Took {inference_stop - inference_start}.")
logging.info("Creating submission.")
# Check our predictions are in the same order as the submission format
assert np.all(
test_metadata.seq_id.unique().tolist() == submission_format.index.to_list()
)
output[: preds.shape[0], :] = preds[: output.shape[0], :]
my_submission = pd.DataFrame(
np.stack(output),
# remember that we are predicting at the sequence, not image level
index=test_metadata.seq_id,
columns=submission_format.columns,
)
# We want to ensure all of our data are floats, not integers
my_submission = my_submission.astype(np.float)
# Save out submission to root of directory
my_submission.to_csv("submission.csv", index=True)
logging.info(f"Submission saved.")
if __name__ == "__main__":
perform_inference()
Now we'll zip everything up, submit it, and get our score!
# remember to avoid including the inference dir itself
!cd inference; zip -r ../submission.zip *
Upload submission¶
Alright! We're ready to submit!
Our submission takes about 20 minutes to execute. Once it's complete, we can see our score id around 0.05! Awesome!
Hopefully this benchmark helps you understand the submission process for this exciting new type of DrivenData competition.
We can't wait to see what you come up with! Happy importing!
Oh, and don't forget:
look_at_random_animal("giraffe", random_state=1111)