by
Michael Schlauch
Introduction to image classification using camera trap images¶
Camera traps are a tool used by conservationists to study and monitor a wide range of ecologies while limiting human interference. However, they also generate a vast amount of data that quickly exceeds the capacity of humans to sift through. That's where machine learning can help! Advances in computer vision can help automate tasks like species detection and identification, so that humans can spend more time learning from and protecting these ecologies.
This post walks through an initial approach for the Conservision Practice Area challenge on DrivenData, a practice competition where you identify animal species in a real world dataset of wildlife images from Tai National Park in Côte d'Ivoire. This is a practice competition designed to be accessible to participants at all levels. That makes it a great place to dive into the world of data science competitions and computer vision.
We will go through the following steps in order to train a PyTorch model that can be used to identify the species of animal in a given image:
- Set up your environment (feel free to skip)
- Download the data
- Explore the data
- Split into train and evaluation sets
- Build the Model
- Training
- Evaluation
- Create submission
The only pre-requisite is a basic familiarity with Python and some of the basic concepts behind deep learning. We'll guide you step-by-step through the rest.
Let's get started!
1. Set up your environment¶
Feel free to skip this step if you already have an environment set up.
The folks on our team typically use conda to manage environments. Once you have conda installed you can create a new "conserviz" environment (name it whatever you like) with:
conda create -n conserviz python=3.8
Then we activate the new environment and install the required libraries with pip. The pip command below includes all the libraries we'll need for this notebook. Launch a jupyter notebook from this new environment.
conda activate conserviz
pip install pandas matplotlib Pillow tqdm scikit-learn torch torchvision
2. Download the data¶
Download the competition data from the Data Download page. You'll need to first register for the competition by clicking on "Compete" and agreeing to the rules.
The competition.zip
file contains everything you need to take part in this competition, including this notebook benchmark.ipynb
. Unzip the archive into a location of your choice. The file structure should look like this:
├── benchmark.ipynb
├── submission_format.csv
├── test_features
│ ├── ZJ000000.jpg
│ ├── ZJ000001.jpg
│ └── ...
├── test_features.csv
├── train_features
│ ├── ZJ016488.jpg
│ ├── ZJ016489.jpg
│ └── ...
├── train_features.csv
└── train_labels.csv
Next, let's import some of the usual suspects:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
from tqdm import tqdm
Read in the train and test CSVs first and see what they look like.
train_features = pd.read_csv("train_features.csv", index_col="id")
test_features = pd.read_csv("test_features.csv", index_col="id")
train_labels = pd.read_csv("train_labels.csv", index_col="id")
The features
CSVs contain the image ID, filepath and site ID for each image.
train_features.head()
test_features.head()
The train_labels
CSV is an indicator matrix of the species identified in each of the training images. Some images are labeled as "blank" if no animal was detected.
train_labels.head()
Let's store a sorted list of the labels, so that we can sort the inputs and outputs to our model in a consistent way.
species_labels = sorted(train_labels.columns.unique())
species_labels
3. Explore the data¶
Now let's see what some of the actual images look like. The code below iterates through a list of species and selects a single random image from each species to display, along with its image ID and label. You can try changing the random_state
variable to display a new set of images.
import matplotlib.image as mpimg
random_state = 42
# we'll create a grid with 8 positions, one for each label (7 species, plus blanks)
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(20, 20))
# iterate through each species
for species, ax in zip(species_labels, axes.flat):
# get an image ID for this species
img_id = (
train_labels[train_labels.loc[:,species] == 1]
.sample(1, random_state=random_state)
.index[0]
)
# reads the filepath and returns a numpy array
img = mpimg.imread(train_features.loc[img_id].filepath)
# plot etc
ax.imshow(img)
ax.set_title(f"{img_id} | {species}")
Can you spot the animals? I'm still not sure where the rodent is. Birds can be tough to spot too.
Let's look at the distribution of species across the training set, first in terms of overall counts and then in percentage terms.
train_labels.sum().sort_values(ascending=False)
train_labels.sum().divide(train_labels.shape[0]).sort_values(ascending=False)
In case you're curious, this distribution is not exactly what we find in the wild. The competition dataset has been curated a little bit to produce a more uniform distribution than we would see in the actual data.
There's a lot more data exploration to do. For example, you might also want to look at the distribution of image dimensions or camera trap sites. But since our primary goal here is to develop a benchmark, let's move on to the modeling!
4. Split into train and evaluation sets¶
First, we'll need to split the images into train and eval sets. We'll put aside 25% of the data for evaluation and stratify by the target labels to ensure we have similar relative frequencies of each class in the train and eval sets.
For the purposes of this benchmark, we're also going to limit ourselves to a 50% subset of the training data, just so that things run faster. But feel free to adjust frac
or remove it entirely if you want to run the training on the full set.
from sklearn.model_selection import train_test_split
frac = 0.5
y = train_labels.sample(frac=frac, random_state=1)
x = train_features.loc[y.index].filepath.to_frame()
# note that we are casting the species labels to an indicator/dummy matrix
x_train, x_eval, y_train, y_eval = train_test_split(
x, y, stratify=y, test_size=0.25
)
Here's what x_train
and y_train
look like now:
x_train.head()
y_train.head()
x_train.shape, y_train.shape, x_eval.shape, y_eval.shape
Next, let's validate that our split has resulted in roughly similar relative distributions of species across the train and eval sets (because of how we passed stratify=y
above).
split_pcts = pd.DataFrame(
{
"train": y_train.idxmax(axis=1).value_counts(normalize=True),
"eval": y_eval.idxmax(axis=1).value_counts(normalize=True),
}
)
print("Species percentages by split")
(split_pcts.fillna(0) * 100).astype(int)
Good, this looks as expected.
5. Build the Model¶
Now we can start building our model.
The Dataset¶
First, we'll create an ImagesDataset
class that will define how we access our data and any transformations we might want to apply.
This new class will inherit from the PyTorch Dataset
class, but we'll also need to define our own __init__
, __len__
and __getitem__
special methods:
__init__
will instantiate the dataset object with two dataframes: anx_train
df containing image IDs and image file paths, and ay_train
df containing image IDs and labels. This will run once when we first create the dataset object, e.g. withdataset = ImagesDataset(x_train, y_train)
.__getitem__
will define how we access a sample from the data. This method gets called whenever we use an indexing operation likedataset[index]
. In this case, whenever accessing a particular image sample (for example, to get the first image we'd dodataset[0]
) the following will happen:- look up the image filepath using the index
- load the image with
PIL.Image
- apply some transformations (more on this below)
- return a dictionary containing the image ID, the image itself as a Tensor, and a label (if it exists)
__len__
simply returns the size of the dataset, which we do by callinglen
on the input dataframe.
import torch
from torch.utils.data import Dataset
from torchvision import transforms
class ImagesDataset(Dataset):
"""Reads in an image, transforms pixel values, and serves
a dictionary containing the image id, image tensors, and label.
"""
def __init__(self, x_df, y_df=None):
self.data = x_df
self.label = y_df
self.transform = transforms.Compose(
[
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(
mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)
),
]
)
def __getitem__(self, index):
image = Image.open(self.data.iloc[index]["filepath"]).convert("RGB")
image = self.transform(image)
image_id = self.data.index[index]
# if we don't have labels (e.g. for test set) just return the image and image id
if self.label is None:
sample = {"image_id": image_id, "image": image}
else:
label = torch.tensor(self.label.iloc[index].values,
dtype=torch.float)
sample = {"image_id": image_id, "image": image, "label": label}
return sample
def __len__(self):
return len(self.data)
Notice that we are also defining a set of transformations, which are defined in the __init__
and called in the __getitem__
special methods. These are applied to each image before returning it. Here's what each of those transformations do and why:
transforms.Resize((224, 224))
ResNet50 was trained on images of size 224x224 so we resize to the same dimensions here. See pytorch docs and the ResNet paper.transforms.ToTensor()
converts the image to a tensor. Since we are passing in a PIL Image at this point, PyTorch can recognize it as an RGB image and will automatically convert the input values which are in the range [0, 255] to a range of [0, 1]. See more from the PyTorch docs.transforms.Normalize(...)
normalizes the image tensors using the mean and standard deviation of ImageNet images. Because this transformation was applied to images when training the ResNet model, we want to do the same here with our images. See more from the PyTorch docs on pretrained models.
The DataLoader¶
Next, we need to load the dataset into a dataloader. The DataLoader
class lets us iterate through our dataset in batches.
from torch.utils.data import DataLoader
train_dataset = ImagesDataset(x_train, y_train)
train_dataloader = DataLoader(train_dataset, batch_size=32)
The data pieces are now largely in place!
6. Training¶
Now it's time to start building our model and then training it.
We'll use a pretrained ResNet50 model as our backbone. ResNets are one of the more popular networks for image classification tasks. The pretrained model outputs a 2048-dimension embedding, which we will then connect to two more dense layers, with a ReLU and Dropout step in between.
These final layers, defined in model.fc
, are the new "head" of our model, and allow us to transform the image embeddings produced by the pretrained "backbone" into the 8-dimensional output required to learn the species classification task we're tackling here. Prior to redefining it below, model.fc
would be the final, dense layer connecting the 2048-dimension embedding to a 1000-dimension output (corresponding to the 1000 ImageNet classes that the pretrained model was trained on). We will instead prepare the model for the current task by redefining model.fc
to produce an 8-dimensional output corresponding to our 8 species classes (including blanks).
We'll also add a couple more layers in between. The ReLU
layer introduces non-linearity into the model head, in effect activating important features and suppressing noise. And the Dropout
layer is a commonly used regularization component that randomly drops some nodes from the previous layer's outputs (10% of nodes in this case) during each training step, mitigating our risk of overfitting.
from torch import nn
import torchvision.models as models
model = models.resnet50(pretrained=True)
model.fc = nn.Sequential(
nn.Linear(2048, 100), # dense layer takes a 2048-dim input and outputs 100-dim
nn.ReLU(inplace=True), # ReLU activation introduces non-linearity
nn.Dropout(0.1), # common technique to mitigate overfitting
nn.Linear(
100, 8
), # final dense layer outputs 8-dim corresponding to our target classes
)
Define our loss function¶
Cross entropy loss (or log loss) is a commonly used loss function for multi-class (not multi-label) image classification. We'll use this to compute loss for each training batch and then update our parameters accordingly.
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
Train the model¶
We're now ready to train our model!
We'll start simple and just run it for one epoch, but feel free to run it for more num_epochs
if you've got the time. We hope to see a decreasing loss as training progresses, which will provide some evidence that the model is learning. Note that we haven't frozen any weights in the pretrained model, a choice which you may want to revisit and we discuss in a little more detail below.
For each epoch we'll iterate through the batches, and for each batch we'll do the following:
- Zero out the gradients. PyTorch will sum the gradients from past batches when doing its backward pass, so in order to make sure we are only using the gradients computed for the current batch, we zero out the gradients at the beginning of each batch.
- Run the forward pass.
- Compute the loss and track it.
- Compute our gradients and update our weight parameters.
num_epochs = 1
tracking_loss = {}
for epoch in range(1, num_epochs + 1):
print(f"Starting epoch {epoch}")
# iterate through the dataloader batches. tqdm keeps track of progress.
for batch_n, batch in tqdm(
enumerate(train_dataloader), total=len(train_dataloader)
):
# 1) zero out the parameter gradients so that gradients from previous batches are not used in this step
optimizer.zero_grad()
# 2) run the foward step on this batch of images
outputs = model(batch["image"])
# 3) compute the loss
loss = criterion(outputs, batch["label"])
# let's keep track of the loss by epoch and batch
tracking_loss[(epoch, batch_n)] = float(loss)
# 4) compute our gradients
loss.backward()
# update our weights
optimizer.step()
Now let's plot the loss by epoch and batch. The x-axis here is a tuple of (epoch, batch)
.
tracking_loss = pd.Series(tracking_loss)
plt.figure(figsize=(10, 5))
tracking_loss.plot(alpha=0.2, label="loss")
tracking_loss.rolling(center=True, min_periods=1, window=10).mean().plot(
label="loss (moving avg)"
)
plt.xlabel("(Epoch, Batch)")
plt.ylabel("Loss")
plt.legend(loc=0)
Good news, the loss is going down! This is an encouraging start, especially since we haven't done anything fancy yet.
Save the model¶
We have the model loaded in memory already, so we don't really need to save the model, but it's often useful to do this so we can use it again later.
Here's how:
torch.save(model, "model.pth")
7. Evaluation¶
So far, not so bad. We've shown an improvement in the loss on the training set, but that tells us little about how our model will do on new data. Let's reload our saved model and try generating some predictions on the evaluation split we created earlier.
loaded_model = torch.load("model.pth")
We create the eval dataset and dataloader just like we did earlier with the training dataset and dataloader:
eval_dataset = ImagesDataset(x_eval, y_eval)
eval_dataloader = DataLoader(eval_dataset, batch_size=32)
Make predictions¶
We'll iterate through the eval dataloader in batches, just like we did for training, but this time we aren't going to need to compute gradients or update weights. For each batch, we'll do the following:
- Run the forward pass to get the model output or logits
- Apply a softmax function to convert the logits into probability space with range[0,1]. During training, the softmax operation was handled internally by
nn.CrossEntropyLoss
. We aren't computing the loss now because we are just doing evaluation, but we still want the predictions to be in the range[0,1]. - Store the results in a dataframe for further analysis
preds_collector = []
# put the model in eval mode so we don't update any parameters
model.eval()
# we aren't updating our weights so no need to calculate gradients
with torch.no_grad():
for batch in tqdm(eval_dataloader, total=len(eval_dataloader)):
# 1) run the forward step
logits = model.forward(batch["image"])
# 2) apply softmax so that model outputs are in range [0,1]
preds = nn.functional.softmax(logits, dim=1)
# 3) store this batch's predictions in df
# note that PyTorch Tensors need to first be detached from their computational graph before converting to numpy arrays
preds_df = pd.DataFrame(
preds.detach().numpy(),
index=batch["image_id"],
columns=species_labels,
)
preds_collector.append(preds_df)
eval_preds_df = pd.concat(preds_collector)
eval_preds_df
Predicted labels distribution¶
First let's review the species distribution we saw in the training set.
print("True labels (training):")
y_train.idxmax(axis=1).value_counts()
Here's the distribution of our predictions on the eval set.
print("Predicted labels (eval):")
eval_preds_df.idxmax(axis=1).value_counts()
The actual evaluation set is more evenly distributed than our predictions, so we already know there is some room for improvement here.
print("True labels (eval):")
y_eval.idxmax(axis=1).value_counts()
Accuracy¶
Now let's compute how accurate our model is and compare that against some trivial baseline models. First let's get the labels with the highest score for each image.
eval_predictions = eval_preds_df.idxmax(axis=1)
eval_predictions.head()
Random guessing across 8 classes would yield an accuracy of 12.5% (1/8). But we could construct a slightly better trivial model by always guessing the most common class ("monkey_prosimian" images in this case).
If we were to always guess that an image is monkey_prosimian
, we could achieve accuracy of 15.8%.
eval_true = y_eval.idxmax(axis=1)
(eval_true == "monkey_prosimian").sum() / len(eval_predictions)
Let's see how our model compares. We take the species with the highest score for each image (eval_predictions
) and compare that to the true labels.
correct = (eval_predictions == eval_true).sum()
accuracy = correct / len(eval_predictions)
accuracy
Our accuracy on the evaluation set is about 50%, which is not a bad start for a very simple first pass and one epoch of training.
Let's look at the predictions from another angle.
We can see from the confusion matrix below that our model does reasonably well on some species, but we have plenty of room for improvement on antelopes, birds, hogs and blanks.
from sklearn.metrics import ConfusionMatrixDisplay
fig, ax = plt.subplots(figsize=(10, 10))
cm = ConfusionMatrixDisplay.from_predictions(
y_eval.idxmax(axis=1),
eval_preds_df.idxmax(axis=1),
ax=ax,
xticks_rotation=90,
colorbar=True,
)
That's where you come in! What can you do to improve on this benchmark?
Here are some ideas you might want to try:
- Train on the full training dataset. We've only used 50% of the training data so far.
- Train for more epochs. We've only done 1 so far.
- Try another pretrained model. For example, you may have more success with EfficientNet, or another ResNet model with more layers like ResNet152. See what's available from pytorch here. You may also want to review which models are or have been state of the art for image classification tasks, for example on paperswithcode.com. Keep in mind that different models will require different input and output dimensions, so you'll need to update how you construct
model
above. - Experiment with different loss functions.
- Experiment with different learning rates or learning rate schedulers.
- Add more layers to the model head (
model.fc
). - You also may want to consider freezing the weights in the backbone model and only training the head (
model.fc
). If this results in higher accuracy, that suggests the current approach may be overwriting the backbone weights in a problematic way. One approach here would be to train just the model head, and then unfreeze the backbone but train at a lower learning rate. - Training will be much faster using GPUs, but you will need to make some small tweaks to the code.
- As you become more comfortable iterating through different versions of the model, you may want to try out PyTorch Lightning or Lightning Flash, which build upon PyTorch and eliminate a lot of boilerplate code, in addition to providing a more complete research framework for deep learning problems.
8. Create submission¶
Last but not least, we'll want to participate in the competition and see where we stand on the leaderboard.
To do this we need to create predictions for the competition test set (not the eval set we used above). You don't have labels for these.
We'll create predictions in the same way we did for the eval
set, but this time using the test_features
we downloaded from the competition website.
test_dataset = ImagesDataset(test_features.filepath.to_frame())
test_dataloader = DataLoader(test_dataset, batch_size=32)
preds_collector = []
# put the model in eval mode so we don't update any parameters
model.eval()
# we aren't updating our weights so no need to calculate gradients
with torch.no_grad():
for batch in tqdm(test_dataloader, total=len(test_dataloader)):
# run the forward step
logits = model.forward(batch["image"])
# apply softmax so that model outputs are in range [0,1]
preds = nn.functional.softmax(logits, dim=1)
# store this batch's predictions in df
# note that PyTorch Tensors need to first be detached from their computational graph before converting to numpy arrays
preds_df = pd.DataFrame(
preds.detach().numpy(),
index=batch["image_id"],
columns=species_labels,
)
preds_collector.append(preds_df)
submission_df = pd.concat(preds_collector)
submission_df
Let's check a couple things on submission_df
before submitting to the platform. We'll want to make sure our submission's index and column labels match the submission format. (The DrivenData platform will do these data integrity checks as well, but it will be quicker to detect problems this way.)
submission_format = pd.read_csv("submission_format.csv", index_col="id")
assert all(submission_df.index == submission_format.index)
assert all(submission_df.columns == submission_format.columns)
Looks like we're ready to submit! Save the dataframe out to a CSV file and then upload it via the Submissions page on the competition website.
submission_df.to_csv("submission_df.csv")
How did we do? We should get a score of ~1.8, though your results may differ slightly due to non-determinism in model training. (For reference, a randomly generated submission yields a score of something like ~2.4.)
Now it is up to you to improve on this benchmark!
Head over to the competition for data and more background info, or the competition forum if you have any questions. Good luck!