blog

How to Use Deep Learning to Identify Wildlife

Benchmark - Hakuna Ma-data: Identify Wildlife on the Serengeti with AI for Earth

Welcome to the benchmark solution tutorial for our new competition in partnership with AI for Earth! In this computer-vision competition, you are tasked with identifying animal species in camera trap footage. The training data consists of over 2.5 million sequences of images collected using camera traps placed in the Serengeti region of Africa. The sequences are one-hot-labeled for 53 different species groups, or as empty. For each sequence (which may be multiple images), you will generate a submission that consists of probabilities for each possible class.

But the fun doesn't stop there! In this competition, you will not be submitting a csv of predictions. Instead, you will submit the code that performs inference on the test data, and we will execute that code in the cloud to generate and score your submission.

In this benchmark, we'll walk through a first-pass approach to loading, understanding, and preparing the data. We'll use an out-of-the-box transfer learning approach to train a Keras model on a subset of the training data. Then we'll explain how to package up your model and submit a file capable of running in our cloud-based execution environment. With all these pipes connected, you'll be ready to hit the ground running.

We've got a large territory to cover, so let's get started!

Data Exploration

Our training set consists of 10 "seasons" of footage. We can get information about each seson's sequences and labels using the training metadata, which provides links between image flenames and sequences, as well as the training labels, which tell us what's in a given sequence.

You can download the metadata as well as the actual image data here. Keep in mind, this is a very large training set, so make sure you have a storage solution. The complete set of images for all 10 seasons is nearly 5 TB! But the image data is split by season, so you can download image files one season at a time. For this benchmark, we're only going to work with a couple of seasons.

Let's load the metadata and look at what we have.

In [1]:
import json
from pathlib import Path

import numpy as np
import pandas as pd

pd.set_option('max_colwidth', 80)

# This is where our downloaded images and metadata live locally
DATA_PATH = Path.cwd().parent / "data/final/public/"
In [2]:
train_metadata = pd.read_csv(DATA_PATH / "train_metadata.csv")
train_labels = pd.read_csv(DATA_PATH / "train_labels.csv", index_col="seq_id")
In [3]:
train_metadata.head()
Out[3]:
file_name seq_id datetime
0 S1/B04/B04_R1/S1_B04_R1_PICT0003.JPG SER_S1#B04#1#3 2010-07-20 06:14:06
1 S1/B04/B04_R1/S1_B04_R1_PICT0004.JPG SER_S1#B04#1#4 2010-07-22 08:56:06
2 S1/B04/B04_R1/S1_B04_R1_PICT0005.JPG SER_S1#B04#1#5 2010-07-24 01:16:28
3 S1/B04/B04_R1/S1_B04_R1_PICT0006.JPG SER_S1#B04#1#6 2010-07-24 08:20:10
4 S1/B04/B04_R1/S1_B04_R1_PICT0007.JPG SER_S1#B04#1#7 2010-07-24 10:14:32

Number of images in the train set.

In [4]:
train_metadata.shape[0]
Out[4]:
6634375

That's a lot of images! However, each image is associated to a sequence. And our predictions will be at the sequence level.

A sequence is an ordered series, in this case of camera trap images ordered in time. When the camera trap is triggered, it often takes more than one image, yielding a an image sequence. In the data, each sequence has a seq_id, which is the index for the labels. The order of images in a given sequence can be inferred from the last four digits in the filename.

Since our predictions are made at the sequence level, this means that there is one label for any given sequence. Imagine that lion walks by and triggers the camera trap, which takes 4 pictures. If the lion keeps walking, they may only appear in the first three pictures in the sequence and be out of frame by the time the 4th image is taken. Despite the 4th frame being empty of any lion, that sequence of 4 images would still be labeled as lion.

Let's confirm that each label in train_labels corresponds to a unique seq_id.

In [5]:
assert train_metadata.seq_id.nunique() == train_labels.index.shape[0]

# number of sequences
train_metadata.seq_id.nunique()
Out[5]:
2454697

We have different seasons in the training set, and different numbers of images per season. We can see which season an image belongs to by looking at the first few characters of the sequence ID.

In [6]:
train_metadata['season'] = train_metadata.seq_id.map(lambda x: x.split('#')[0])
train_metadata.season.value_counts().sort_index()
Out[6]:
SER_S1     406612
SER_S10    682736
SER_S2     567428
SER_S3     388335
SER_S4     528504
SER_S5     822793
SER_S6     458417
SER_S7     827051
SER_S8     975394
SER_S9     977105
Name: season, dtype: int64

Keep in mind that the test set comes from seasons not represented in the training set. So our model needs to generalize to seasons it hasn't seen before.

As we can see, location values are not unique between seasons. So while our model will need to generalize across seasons, it might get to "revisit" the same location from season to season.

Below we see that the number of images we have for each sequence varies, but by far most sequences have between 1 and 3 images in them.

In [7]:
train_metadata.groupby('seq_id').size().value_counts().sort_index()
Out[7]:
1      359209
2       11457
3     2084000
4          15
5           1
6           3
8           1
9           4
10          1
12          2
13          1
14          1
27          1
37          1
dtype: int64

For this benchmark, we're going to simplify the problem by taking only the first image from each sequence. The justifying assumption here is that the period of time immediately after a camera trap is first triggered is the most likely time to see an animal in frame. However, this may not always be the case and ultimately you'll probably want to give your model as much information as you can by using more images per sequence.

In [8]:
#reduce to first frame only for all sequences
train_metadata = train_metadata.sort_values('file_name').groupby('seq_id').first()

Now, let's look at the labels. Each sequence label is a one-hot-encoded row vector with a 1 in the coloumn corresponding to a species that is present, and a 0 otherwise. Each row corresponds to a unique sequence ID.

In [9]:
train_labels.head()
Out[9]:
aardvark aardwolf baboon bat batearedfox buffalo bushbuck caracal cattle cheetah ... serval steenbok topi vulture warthog waterbuck wildcat wildebeest zebra zorilla
seq_id
SER_S1#B04#1#10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SER_S1#B04#1#11 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SER_S1#B04#1#12 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
SER_S1#B04#1#13 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
SER_S1#B04#1#14 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 54 columns

Though most sequences have only one animal in them, it is possible to mutliple animals in a single sequence.

In [10]:
train_labels.sum(axis=1).value_counts()
Out[10]:
1    2428805
2      25667
3        223
4          2
dtype: int64

Notice below that not only is empty a category, but it's the most dominant category––by far! This is because, as useful as camera traps are, they tend to register many false positives. For example, they are often triggered by wind blowing plant life around, or fluctuations of heat and light in the surrounding area.

In [11]:
train_labels.mean(axis=0).sort_values(ascending=False)
Out[11]:
empty              0.759851
wildebeest         0.076248
zebra              0.053718
gazellethomsons    0.037591
buffalo            0.009228
elephant           0.008200
hartebeest         0.008055
impala             0.006635
gazellegrants      0.006394
giraffe            0.006195
warthog            0.005640
guineafowl         0.004460
otherbird          0.004275
hyenaspotted       0.004108
lionfemale         0.002366
eland              0.002281
hippopotamus       0.002017
reedbuck           0.001916
topi               0.001663
baboon             0.001154
dikdik             0.001125
cheetah            0.000888
lionmale           0.000710
secretarybird      0.000695
serval             0.000551
jackal             0.000531
ostrich            0.000509
koribustard        0.000506
hare               0.000364
aardvark           0.000310
insectspider       0.000304
batearedfox        0.000245
monkeyvervet       0.000235
waterbuck          0.000232
porcupine          0.000231
mongoose           0.000230
aardwolf           0.000171
bushbuck           0.000160
leopard            0.000158
hyenastriped       0.000089
reptiles           0.000062
caracal            0.000047
rodents            0.000040
wildcat            0.000039
honeybadger        0.000033
vulture            0.000033
duiker             0.000032
civet              0.000030
genet              0.000028
rhinoceros         0.000027
zorilla            0.000016
cattle             0.000009
steenbok           0.000004
bat                0.000002
dtype: float64

Now that we have a sense of what the data means, let's get access to the images themselves!

Dealing With Large Amounts of Data

Since there are millions of high-quality images, unzipping all the data takes a long time. We recommend starting your model development with a single season or two of data while the rest of the data downloads. In this benchmark, we'll work with seasons 1 and 3.

In [12]:
train_metadata = train_metadata[train_metadata.season.isin(['SER_S1', 'SER_S3'])]
train_labels = train_labels[train_labels.index.isin(train_metadata.index)]

Add Full Image Path to the Data

Our local data is mounted under the /databig/raw directory in folders that match the name of the zipfile.

In [13]:
IMAGE_DIR = Path("/databig/raw")

We'll convert the file_name column to a Path object with the full path to our data.

In [14]:
train_metadata['file_name'] = train_metadata.apply(
    lambda x: (IMAGE_DIR / f'SnapshotSerengeti_S0{x.season[-1]}_v2_0' / x.file_name), axis=1
)
In [15]:
train_metadata.head()
Out[15]:
file_name datetime season
seq_id
SER_S1#B04#1#10 /databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0010.JPG 2010-07-30 05:24:50 SER_S1
SER_S1#B04#1#11 /databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0011.JPG 2010-07-30 20:54:16 SER_S1
SER_S1#B04#1#12 /databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0012.JPG 2010-07-30 20:57:28 SER_S1
SER_S1#B04#1#13 /databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0013.JPG 2010-08-01 17:35:58 SER_S1
SER_S1#B04#1#14 /databig/raw/SnapshotSerengeti_S01_v2_0/S1/B04/B04_R1/S1_B04_R1_PICT0014.JPG 2010-08-02 11:43:14 SER_S1

Before we get into the weeds of modeling, let's take a quick break to look at some of animals in this data!

In [16]:
from IPython.display import Image 

def look_at_random_animal(name, random_state, width=500):
    seq_ids = train_labels[train_labels[name] == 1].index
    file_names = train_metadata.loc[seq_ids].file_name
    filename = file_names.sample(random_state=random_state).values[0]
    return Image(filename=str(filename), width=width)

A particular animal that catches our eye is the ... zorilla? Let's check it out.

In [17]:
look_at_random_animal("zorilla", random_state=111)
Out[17]:

Cute. What else?

In [18]:
look_at_random_animal("wildebeest", random_state=101)
Out[18]: