A chimp next to the Zooniverse logo, the organization behind the crowd-sourced labels used in this competition.

Most of us don't get the chance to stroll around the jungle, gazing upon its Fantastic Beasts in awe. But thanks to motion and heat triggered camera traps, the jungle can now come to us. Camera traps enable researchers to observe the most sensitive habitats with minimal impact. Once set, camera traps can passively monitor a site, collecting many hours of footage.

However, annotating the footage collected by camera traps can be quite time-consuming. Even expert researchers must spend hundreds of hours doing simple species annoation — time that they'd rather be spending exploring the deeper questions realted to the wildlife ecology of the seat of life on Earth.

In our brand new competition, we're helping make it easier for research teams to study camera trap footage by predicting the species present in a given video. Automated video species tagging could save many human hours of annotation, allowing researchers to focus on higher-level research and conservation efforts.

In this post, we'll walk through a very simple first pass model for species classification in camera trap footage. Video data can be intimidating, but this post will show how to load the data, make some predictions, and then submit those predictions to the competition.

Okay, to get things rolling, let's load up some basic tools of the trade.

Note: we're using Python 3 in this notebook. You can check which version of Python you're using by running `python -V`.

In [1]:

%matplotlib inline

import os

# let's not pollute this blog post with warnings
from warnings import filterwarnings

filterwarnings("ignore")

import keras
import numpy as np
import pandas as pd
import skvideo.io as skv
from tqdm import tqdm

Using TensorFlow backend.

Loading the data¶

Camera trap footage of some elephants strolling along, taken from the dataset!

On the data download page, we provide a couple of datasets to get started:

Camera trap footage: we have a few hundred thousand clips from camera traps around Africa. These are our main model inputs. The raw data is over 1TB, so we've created extremely downsampled versions of the dataset to facilitate faster prototyping. There is the micro version of the data, which is about 3.5 GB, and the nano version, which is about 1.5 GB. All versions are hosted as direct downloads and as well as torrent files. For this benchmark we'll use the nano data.
Crowd-sourced species labels for camera trap training set: generated by thousands of citizen scientists at Chimp&See. These are our labels. Each row is indexed by a video filename and each column corresponds to a species that may or may not be present in the video as indicated by a 1 or 0 respectively.
Submission format: This gives us the filenames and columns of our submission prediction, filled with all zeros as a baseline. The filenames should be used to index into the video directory (e.g., nano) to generate test predictions.

One of the fun things about this challenge is that multiple species may be present in a given video, making this a multilabel classification challenge. That's why each video has so many columns associated with it.

Let's check out some of the training labels!

In [19]:

# load the data
labelpath = os.path.join("..", "data", "final", "train_labels.csv")
train_labels = pd.read_csv(labelpath, index_col="filename")

In [20]:

train_labels.head()

Out[20]:

	bird	blank	cattle	chimpanzee	elephant	forest buffalo	gorilla	hippopotamus	human	hyena	...	other (primate)	pangolin	porcupine	reptile	rodent	small antelope	small cat	wild dog	duiker	hog
filename
000libDc84.mp4	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
003TeGtbkD.mp4	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
006jFoesFi.mp4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
008uxqP8IN.mp4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
0094UxdyyZ.mp4	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 24 columns

In [21]:

# How many training videos do we have, and what species are present?
train_labels.info()

<class 'pandas.core.frame.DataFrame'>
Index: 204130 entries, 000libDc84.mp4 to zzzu2lK8bC.mp4
Data columns (total 24 columns):
bird                   204130 non-null float64
blank                  204130 non-null float64
cattle                 204130 non-null float64
chimpanzee             204130 non-null float64
elephant               204130 non-null float64
forest buffalo         204130 non-null float64
gorilla                204130 non-null float64
hippopotamus           204130 non-null float64
human                  204130 non-null float64
hyena                  204130 non-null float64
large ungulate         204130 non-null float64
leopard                204130 non-null float64
lion                   204130 non-null float64
other (non-primate)    204130 non-null float64
other (primate)        204130 non-null float64
pangolin               204130 non-null float64
porcupine              204130 non-null float64
reptile                204130 non-null float64
rodent                 204130 non-null float64
small antelope         204130 non-null float64
small cat              204130 non-null float64
wild dog               204130 non-null float64
duiker                 204130 non-null float64
hog                    204130 non-null float64
dtypes: float64(24)
memory usage: 38.9+ MB

There are a lot of cool species in these videos! Also, no NaNs in sight. It's going to be a good day.

In [24]:

# How many of each species?
train_labels.sum(axis=0).sort_values(ascending=False)

Out[24]:

blank                  122270.0
duiker                  21601.0
other (primate)         20453.0
human                   20034.0
chimpanzee               5045.0
hog                      4650.0
rodent                   2911.0
bird                     2386.0
other (non-primate)      1883.0
elephant                 1085.0
porcupine                 569.0
cattle                    372.0
small antelope            273.0
large ungulate            224.0
leopard                   209.0
hippopotamus              175.0
gorilla                   174.0
small cat                  79.0
pangolin                   63.0
wild dog                   21.0
hyena                      10.0
forest buffalo              9.0
reptile                     8.0
lion                        2.0
dtype: float64

Looking through the data, we see that most of the videos are blank, meaning there is no species present. This could mean that the traps are triggered too easily, but in any case it's useful to keep in mind for modeling.

On the upside, there are thousands of chimps, a bunch of elephants, and tons of DrivenData's new official mascot: the duiker.

In [25]:

# How many videos have more than one species present?
(train_labels.sum(axis=1) > 1).sum()

Out[25]:

Ok, not too many in the training data, but still woth considering since we have the power of deep learning at our fingertips.

We're almost ready to turn to the prediction task, but first a word on working with the videos themselves.

We built a custom Dataset class for handling batch generation and storing predictions. Working with video can be annoying to say the least. In order to facilitate faster model prototyping, we have written a Dataset class that can be used by the keras .fit_generator() method to serve batches of training data. It uses the filenames in the data csvs to index into the video directory. It also stores useful information about the dataset, such as number of samples, size of the videos, and even validation splits!

The class — which is available if you download these two files and placing them in the same directory as your notebook — has only been tested for use with the nano and micro versions of the datasets. If you try to use it with the raw version, there will likely be some edits neccessary since the videos aren't square.

The dataset class also assumes that the datapath directory contains

a directoy dataset_type named nano, micro, or raw.
train_labels.csv
submission_format.csv

We're going to store the instance of he dataset as data, so data.anything is using the class. Feel free to play around with this and build it out more, or abandon it completely!

In [2]:

# import the custom data handler
from primatrix_dataset_utils import Dataset

In [3]:

datapath = os.path.join("..", "data", "final")
data = Dataset(datapath=datapath, reduce_frames=True, batch_size=32, test=False)

In [4]:

# confirm number of classes
data.num_classes

Out[4]:

In [5]:

# reduced frame count for faster processing
data.num_frames

Out[5]:

In [6]:

# check our batch size
data.batch_size

Out[6]:

In [7]:

# number of training samples
data.num_samples

Out[7]:

We're not going to train on all of those samples. We'll instead use around 30,000

The Error Metric - AggregatedLogLoss¶

Performance is evaluated according to an aggregated log loss. This is similar to the binary log loss, but to acccount for the possibility of multiple labels treats each column as its own, independent binary log loss and sums the results for all labels.

To see how this metric manifests in our Keras model below, note the sigmoid activation of the final layer of the network, as well as the binary_crossentropy loss function specified in model.compile(). Keras infers the multilabel nature of the problem automatically by looking at the shape of the labels.

Building a Model¶

What can't it do?

There are many ways we could approach this modelling problem. One of the simplest might be to extract a frame from each video and train a basic image classifer on the result. Of course, animals may move in and out of frame making our chosen frame very important. The most sophisticated approaches might use the raw video data as input to get the most out of every pixel. Here, we'll stick with something in between.

We're going to use keras (not PyTorch, sorry ;-)) to train a multilabel video classifier on the nano dataset, taking a downsampled version of the nano videos as input. Additionally, in the interest of training time, we're going to train our model on a subset of training data. However, we'll predict on the full set.

Our general workflow will be to:

Build a model architecture
Train for a couple epochs with validation
Generate predctions for the entire test set.

First let's consider a couplw of key aspects of the model environment.

Model Environment¶

The goal of this benchmark is to provide a clear path from data download to prediction submission. In that spirit, we're going to train a simple model on a subset of the nano version of the camera trap footage.

Since we're processing video tensors, using a GPU will still provide substantial speedup. If using Amazon Web Services is an option for you, we reccomend

Spinning up an EC2 instance with a GPU
Installing FFMPEG to the instance for video processing
Setting up your deep leanring environment with jupyter (we're going to use keras with a tensorflow backend)
Creating an SSH tunnel so that you can access your gpu-powered Jupyter notebook from the comport of your own browser!

As always, make sure to avoid unneccessary charges by stopping your EC2 instance when not training or editing code!

GPU¶

First we check this AWS instance sees our GPU!

If the below doesn't work for you, here is a relatively painless guide to setting up your GPU and tensorflow on AWS Ubuntu 16.04.

In [8]:

%%bash
nvidia-smi

Wed Oct 18 09:02:15 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.88                 Driver Version: 375.88                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   50C    P0    61W / 149W |      0MiB / 11439MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Great, now let's make sure our tensorflow backend is using the GPU using the following handy method.

In [9]:

from tensorflow.python.client import device_lib


def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == "GPU"]


print(f"Available GPUs:\t{get_available_gpus()}")

Available GPUs:	['/gpu:0']

Alright! Now we're ready to go deep...

Build the Model¶

We're going to classify our data using a very simple version of of the Long-term Recurrent Convolutional Network deep learning architecture, also known as LRCN architecture:

LRCNs extract features using convolutional layers and pass those as inputs to a Long short-term Memory network for classification.

We can use the built-in keras TimeDistributed wrapper to easily enable temporal convolutional processing of video tensors.

Let's import the keras objects we need.

In [10]:

from keras.models import Sequential
from keras.layers import TimeDistributed, Conv2D, MaxPooling2D, Flatten, Dropout, Dense
from keras.layers.recurrent import LSTM

Simple Keras LRCN¶

The model we build below is by no means optimized, but it's a start! The goal of this benchmark is to present one possible workflow from data download to competition submission.

In [11]:

# instantiate model
model = Sequential()

# add three time-distributed convolutional layers for feature extraction
model.add(
    TimeDistributed(
        Conv2D(64, (3, 3), activation="relu"),
        input_shape=(data.num_frames, data.width, data.height, 1),
    )
)
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(1, 1))))

model.add(TimeDistributed(Conv2D(128, (4, 4), activation="relu")))
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(2, 2))))

model.add(TimeDistributed(Conv2D(256, (4, 4), activation="relu")))
model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(2, 2))))


# extract features and dropout
model.add(TimeDistributed(Flatten()))
model.add(Dropout(0.5))

# input to LSTM
model.add(LSTM(256, return_sequences=False, dropout=0.5))

# classifier with sigmoid activation for multilabel
model.add(Dense(data.num_classes, activation="sigmoid"))

# compile the model with binary_crossentropy loss for multilabel
model.compile(optimizer="rmsprop", loss="binary_crossentropy")

# look at the params before training
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_1 (TimeDist (None, 15, 14, 14, 64)    640       
_________________________________________________________________
time_distributed_2 (TimeDist (None, 15, 13, 13, 64)    0         
_________________________________________________________________
time_distributed_3 (TimeDist (None, 15, 10, 10, 128)   131200    
_________________________________________________________________
time_distributed_4 (TimeDist (None, 15, 5, 5, 128)     0         
_________________________________________________________________
time_distributed_5 (TimeDist (None, 15, 2, 2, 256)     524544    
_________________________________________________________________
time_distributed_6 (TimeDist (None, 15, 1, 1, 256)     0         
_________________________________________________________________
time_distributed_7 (TimeDist (None, 15, 256)           0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 15, 256)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_1 (Dense)              (None, 24)                6168      
=================================================================
Total params: 1,187,864
Trainable params: 1,187,864
Non-trainable params: 0
_________________________________________________________________

Train the Model¶

Alright, let's train this net! Notice below that we've commented out parameters that would allow training on the entire dataset. We've also set epochs=2 because well, we don't have all day over here – training a winning model is your job!

In [12]:

# train the model with validation
model.fit_generator(
    data.batches(),
    steps_per_epoch=500,  # data.num_batches to train on full set
    epochs=2,
    validation_data=data.val_batches(),
    validation_steps=30,  # data.num_val_batches to validate on full set
)

Epoch 1/2
500/500 [==============================] - 2276s - loss: 0.1087 - val_loss: 0.1032
Epoch 2/2
500/500 [==============================] - 2247s - loss: 0.0952 - val_loss: 0.0903

Out[12]:

<keras.callbacks.History at 0x7f53f0fe1588>

Save the Model¶

Deep learning networks can take a long time to train, so it's always a good idea to save the learned parameters!

In [13]:

# save model
benchmark_model_name = "benchmark-model.h5"
model.save(benchmark_model_name)

Time to Predict and Submit¶

And now we make our predictions! We will load the saved model and test on every video in the index of the submission_format.csv. As before, the batch generation is handled by our custom Dataset class, which is available for you to download.

In [14]:

# load model
from keras.models import load_model

trained_model = load_model(benchmark_model_name)

In [15]:

# generate predictions
for batch_num in tqdm(range(data.num_test_batches), total=data.num_test_batches):
    # make predictions on batch
    results = trained_model.predict_proba(
        next(data.test_batches()), batch_size=data.batch_size, verbose=0
    )

    # update submission format dataframe stored in dataset object
    data.update_predictions(results)

100%|██████████| 2733/2733 [3:20:20<00:00,  4.40s/it]

Save Predictions¶

All we have to do now is save our predictions and make a submission. Just to confirm that we're following the submission format, let's look at the first few rows:

In [17]:

# save results!
data.predictions.to_csv(os.path.join(data.datapath, "predictions.csv"))

In [18]:

!head -n 5 ../data/final/predictions.csv

filename,bird,blank,cattle,chimpanzee,elephant,forest buffalo,gorilla,hippopotamus,human,hyena,large ungulate,leopard,lion,other (non-primate),other (primate),pangolin,porcupine,reptile,rodent,small antelope,small cat,wild dog,duiker,hog
001Ay6nkVD.mp4,0.009775553829967976,0.6810401678085327,0.000768320809584111,0.018967922776937485,0.003553479677066207,1.962547235834222e-09,1.8560382386567653e-06,8.171169838533388e-07,0.05778587982058525,6.748495096076113e-10,4.941477982356446e-07,1.8580927871880704e-06,2.797627240980205e-09,0.006838700734078884,0.09401832520961761,1.117014813978301e-09,0.001039090333506465,1.2601309995474708e-09,0.010130636394023895,2.946824679384008e-06,1.2481896627392075e-09,1.2969243456950608e-09,0.0625377669930458,0.016587795689702034
009Of0AOw7.mp4,0.010196062736213207,0.6818720102310181,0.0007494853343814611,0.018015071749687195,0.003547474043443799,1.8182472194538946e-09,1.6828773823363008e-06,7.917624884612451e-07,0.056747425347566605,6.304161637160632e-10,4.778618745149288e-07,1.8334073956793873e-06,2.644408914065366e-09,0.006735641974955797,0.09705659002065659,1.018643502881389e-09,0.0009638641495257616,1.1905175734128193e-09,0.009979983791708946,2.7651515210891375e-06,1.2102715496453698e-09,1.1860050719292303e-09,0.061693765223026276,0.015941549092531204
00BqeS6Kmm.mp4,0.009766797535121441,0.6863900423049927,0.000777286768425256,0.020364506170153618,0.0036751865409314632,2.250680974924535e-09,2.0353256786620477e-06,8.941434543885407e-07,0.058506984263658524,7.514975863820439e-10,5.292344553708972e-07,2.106661213474581e-06,3.124856151615063e-09,0.007320872042328119,0.0929027795791626,1.313774977695914e-09,0.0011194864055141807,1.3882051064229017e-09,0.010845153592526913,3.3483854622318177e-06,1.282071449004718e-09,1.419261819179951e-09,0.062566377222538,0.017946207895874977
00L6LKgIh9.mp4,0.009923667646944523,0.6859549880027771,0.0007874212460592389,0.02031836286187172,0.003745612921193242,2.2426076551340657e-09,2.0369450339785544e-06,8.862347158355988e-07,0.05904819071292877,7.500398080395598e-10,5.217544298830035e-07,2.1110597572260303e-06,3.1254938637204077e-09,0.007293570786714554,0.09231553971767426,1.2889048717212859e-09,0.0011299143079668283,1.3915241181550186e-09,0.010941104032099247,3.381444685146562e-06,1.2906540280965828e-09,1.418379635964584e-09,0.06264828890562057,0.01783192902803421

Looks good, now we can submit it to the competition.

Submit to Leaderboard¶

Woohoo! It's a start! And that's exactly what we intend with these benchmarks. We're sure you'll be able to top this model in no time, and we can't wait to see what you come up with.

Just don't be fooled by imposters!

Pri-matrix Factorization - Benchmark

Loading the data¶

The Error Metric - AggregatedLogLoss¶

Building a Model¶

Model Environment¶

GPU¶

Build the Model¶

Simple Keras LRCN¶

Train the Model¶

Save the Model¶

Time to Predict and Submit¶

Save Predictions¶

Submit to Leaderboard¶

Tags

Latest posts

Productive math talk: a simple reference solution for Trace the Ace

Launching the K-12 AI Infrastructure Platform

Meet the winners of the On Top of Pasketti: Children's Speech Recognition Challenge

DrivenData 10-Year Impact Report: Three pathways to creating social impact with data science and AI

Improving Automatic Speech Recognition for Kids - A Reference Implementation for Phonetic-level Transcription

Improving Automatic Speech Recognition for Kids - A Reference Implementation for Word-level Transcription

5 Challenges of Creating Beautiful Data Pipelines

AI Agents in Data Science Competitions: Lessons from the Leaderboard

Linking nonprofit grants to organizations with machine learning

Bringing small water bodies into view: Sentinel-2 satellite monitoring of harmful algal blooms (HABs)

Solving the last-mile public data problem

DrivenData Joins U.S. Department of Energy's Genesis Mission to Advance AI for Science and the Public Good

Meet the winners of Phase 3 of the PREPARE Challenge

Meet the winners of the AI for Advancing Instruction Challenge

Automating wildlife monitoring with Zamba & Zamba Cloud

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

Community Spotlight: Kirill Brodt

Jump-starting data infrastructure and in-house data expertise

A production application to support survivors of human trafficking

Life beyond the leaderboard

Work with us to build a better world

Loading the data¶

The Error Metric - AggregatedLogLoss¶

Building a Model¶

Model Environment¶

GPU¶

Build the Model¶

Simple Keras LRCN¶

Train the Model¶

Save the Model¶

Time to Predict and Submit¶

Save Predictions¶

Submit to Leaderboard¶

Tags

Stay updated

Latest posts

Work with us to build a better world