blog

Where's Whale-Do? Data Exploration and Benchmark


by Michael Schlauch

baby beluga

Welcome to the benchmark notebook for the Where's Whale-do? competition!

If you are just getting started, first checkout the competition homepage and problem description.

Where's Whale-do?

Cook Inlet beluga whales are at risk of extinction. This beluga population began declining in the 1990s due to overhunting and currently contains less than 300 surviving individuals. As a result, NOAA Fisheries conducts an annual photo-identification survey to more closely monitor and track individual whales. This is where we need your help!

The goal of this $35,000 challenge is to help wildlife researchers accurately match beluga whale individuals from photographic images. Accelerated and scalable photo-identification is critical to population assessment, management, and protection of this endangered whale population.

For this competition, you will be identifying which images in a database contain the same individual beluga whale seen in a query image.

You will be provided with a set of queries, each one specifying a single image of a beluga whale and a corresponding database to search for matches to that same individual. The database will include images of both matching and non-matching belugas. This is a learning-to-rank information retrieval task.

This notebook covers two main areas:

  • Section 1. Data exploration:: An introduction to the beluga images dataset, including examples of the different image types and visual features to be aware of.

  • Section 2. Demo submission: A demonstration of how to run the benchmark example and produce a valid code submission.

Section 1: Data exploration

Download the data

First, download the images and metadata.csv files from the competition website.

Save the files in the data directory so that your tree looks like this.

boem-belugas-runtime/             # This repository's root
└── data/                         # Competition data directory
    ├── databases/                # Directory containing the database image IDs for 
    │      │                          each scenario
    │      ├── scenario01.csv
    │      └── scenario02.csv
    ├── images/                   # Directory containing all the images
    │      ├── train0001.jpg
    │      ├── train0002.jpg
    │      ├── train0003.jpg
    │      └── ...
    ├── queries/                  # Directory containing the query image IDs for 
    │      │                          each scenario
    │      ├── scenario01.csv
    │      └── scenario02.csv
    ├── metadata.csv              # CSV file with image metadata (image dimensions, 
    │                                 viewpoint, date)
    └── query_scenarios.csv       # CSV file that lists all test scenarios with paths

If you're working off a clone of this runtime repository, you should already have copies of the databases, queries and query_scenarios.csv files.

Explore the data

First, let's load a couple of the data files we just downloaded. Initially we are just going to be focused on the metadata file.

In [1]:
from pathlib import Path

import pandas as pd

PROJ_DIRECTORY = Path.cwd().parent
DATA_DIRECTORY = PROJ_DIRECTORY / "data"
SUBM_DIRECTORY = PROJ_DIRECTORY / "submission"

metadata = pd.read_csv(DATA_DIRECTORY / "metadata.csv", index_col="image_id")
query_scenarios = pd.read_csv(DATA_DIRECTORY / "query_scenarios.csv", index_col="scenario_id")

Look at some sample images

Let's begin by looking at some images with our regular old human eyes, before handing things over to the computer.

The function below shows a random sample of images (change random_state to get a new set) for a given viewpoint.

In [2]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def display_images(viewpoint="top", random_state=1, metadata=metadata):
    # set plot layout depending on image viewpoint
    nrows, ncols = (1, 5) if viewpoint == "top" else (4, 2)
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15,8))
    # get a random sample of images
    sample = metadata[metadata.viewpoint == viewpoint].sample(nrows*ncols, random_state=random_state)
    # plot in grid
    for img_path, ax in zip(sample.path, axes.flat):
        img = mpimg.imread(DATA_DIRECTORY / img_path)
        ax.imshow(img)    

Let's look at a random sample of "top" images, taken from overhead by drone. Note the differences in color, marks and scarring, as well as which regions of the body are visible. Also note that other factors like the water and lighting conditions can affect the quality of the image.

In [3]:
display_images("top", random_state=1)
In [4]:
display_images("top", random_state=2)