by
Michael Schlauch
Welcome to the benchmark notebook for the Where's Whale-do? competition!
If you are just getting started, first checkout the competition homepage and problem description.
Where's Whale-do?¶
Cook Inlet beluga whales are at risk of extinction. This beluga population began declining in the 1990s due to overhunting and currently contains less than 300 surviving individuals. As a result, NOAA Fisheries conducts an annual photo-identification survey to more closely monitor and track individual whales. This is where we need your help!
The goal of this $35,000 challenge is to help wildlife researchers accurately match beluga whale individuals from photographic images. Accelerated and scalable photo-identification is critical to population assessment, management, and protection of this endangered whale population.
For this competition, you will be identifying which images in a database contain the same individual beluga whale seen in a query image.
You will be provided with a set of queries, each one specifying a single image of a beluga whale and a corresponding database to search for matches to that same individual. The database will include images of both matching and non-matching belugas. This is a learning-to-rank information retrieval task.
This notebook covers two main areas:
Section 1. Data exploration:: An introduction to the beluga images dataset, including examples of the different image types and visual features to be aware of.
Section 2. Demo submission: A demonstration of how to run the benchmark example and produce a valid code submission.
Section 1: Data exploration¶
Download the data¶
First, download the images
and metadata.csv
files from the competition website.
Save the files in the data
directory so that your tree looks like this.
boem-belugas-runtime/ # This repository's root
└── data/ # Competition data directory
├── databases/ # Directory containing the database image IDs for
│ │ each scenario
│ ├── scenario01.csv
│ └── scenario02.csv
├── images/ # Directory containing all the images
│ ├── train0001.jpg
│ ├── train0002.jpg
│ ├── train0003.jpg
│ └── ...
├── queries/ # Directory containing the query image IDs for
│ │ each scenario
│ ├── scenario01.csv
│ └── scenario02.csv
├── metadata.csv # CSV file with image metadata (image dimensions,
│ viewpoint, date)
└── query_scenarios.csv # CSV file that lists all test scenarios with paths
If you're working off a clone of this runtime repository, you should already have copies of the databases
, queries
and query_scenarios.csv
files.
Explore the data¶
First, let's load a couple of the data files we just downloaded. Initially we are just going to be focused on the metadata
file.
from pathlib import Path
import pandas as pd
PROJ_DIRECTORY = Path.cwd().parent
DATA_DIRECTORY = PROJ_DIRECTORY / "data"
SUBM_DIRECTORY = PROJ_DIRECTORY / "submission"
metadata = pd.read_csv(DATA_DIRECTORY / "metadata.csv", index_col="image_id")
query_scenarios = pd.read_csv(DATA_DIRECTORY / "query_scenarios.csv", index_col="scenario_id")
Look at some sample images¶
Let's begin by looking at some images with our regular old human eyes, before handing things over to the computer.
The function below shows a random sample of images (change random_state
to get a new set) for a given viewpoint.
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
def display_images(viewpoint="top", random_state=1, metadata=metadata):
# set plot layout depending on image viewpoint
nrows, ncols = (1, 5) if viewpoint == "top" else (4, 2)
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15,8))
# get a random sample of images
sample = metadata[metadata.viewpoint == viewpoint].sample(nrows*ncols, random_state=random_state)
# plot in grid
for img_path, ax in zip(sample.path, axes.flat):
img = mpimg.imread(DATA_DIRECTORY / img_path)
ax.imshow(img)
Let's look at a random sample of "top" images, taken from overhead by drone. Note the differences in color, marks and scarring, as well as which regions of the body are visible. Also note that other factors like the water and lighting conditions can affect the quality of the image.
display_images("top", random_state=1)
display_images("top", random_state=2)