Bumble bee or honey bee?

We're excited to launch a new comeptition with our partner Metis. The question at hand is: can you identify a bee as a honey bee or a bumble bee?

Our Naïve Bees Classifier

We're excited to launch a new comeptition with our partner Metis. If you're not sure what we're talking about, head to:

The Competition Homepage

The question at hand is: can you identify a bee as a honey bee or a bumble bee? These bees have different behaviors and appearances, but given the variety of backgrounds, positions, and image resolutions it can be a challenge for machines to tell them apart. You can win part of the $5,000 prize pool by competing!

Being able to identify bee species from images is a task that ultimately would allow researchers to more quickly and effectively collect field data. Pollinating bees have critical roles in both ecology and agriculture, and diseases like colony collapse disorder threaten these species.

Here, we'll walk through what a first-pass approach to this question could look like. We'll use relatively simple features and models to show that even without GPU computing and neural nets, we can start to answer this question.

You can view and download this notebook on NBViewer.

This notebook contains three sections:

We'll get started by loading the python modules in our analysis toolkit.

In [1]:
import os
from tqdm import tqdm # smart progress bar


# ================  GRAPHICS
from PIL import Image

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import rc_params
 
# get our styles
mpl_default = rc_params()
 
import seaborn as sns
sns.set(rc=mpl_default)
 
%matplotlib inline
plt.rcParams = mpl_default

# ================  NUMBERS AND MATHS AND THINGS
import pandas as pd
import numpy as np

# BEES colormap! Black -> Yellow
CMAP = [(0,0,0),
        (22,0,0),
        (43,0,0),
        (77,14,0),
        (149,68,0),
        (220,123,0),
        (255,165,0),
        (255,192,0),
        (255,220,0),
        (255,235,0),
        (255,245,0),
        (255,255,0)]


bees_cm = mpl.colors.ListedColormap(np.array(CMAP)/255.)

Load the data

The first step is to look at the files we downloaded from the competition webpage and load the training labels and submission format. The data for this competition is very simple. It's a set of training and test images, and then a set of labels for the training images. The submission format has a placeholder for your predictions. As noted on the problem description page, a prediction of 1.0 means you are sure that the bee pictured is a Bombus. A prediction of 0.0 means that you are sure it's an Apis. A prediction of 0.5 means you don't know which it is.

In [2]:
# list the contents of the data directory
!tree --filelimit=10 data/
data/
├── SubmissionFormat.csv
├── images
│   ├── test [992 entries exceeds filelimit, not opening dir]
│   └── train [3969 entries exceeds filelimit, not opening dir]
└── train_labels.csv

3 directories, 2 files
In [3]:
# load the labels using pandas
labels = pd.read_csv("data/train_labels.csv",
                     index_col=0)

submission_format = pd.read_csv("data/SubmissionFormat.csv",
                                index_col=0)

print "Number of training examples is: ", labels.shape[0]
print "Predictions should be type:", labels.dtypes[0]
labels.head()
Number of training examples is:  3969
Predictions should be type: float64
Out[3]:
genus
id
520 1
3800 1
3289 1
2695 1
4922 1
In [4]:
# let's see what's in one of the image folders
!ls data/images/train/ | head -n 10
1.jpg
100.jpg
1000.jpg
1002.jpg
1003.jpg
1004.jpg
1007.jpg
1008.jpg
1009.jpg
1010.jpg

We can see that the image file names match the id in labels and submission_format. This is how we match an image to whether the bee is an Apis or Bombus.

Ok, finally, let's look at some bees. The PIL library (we recommend the Pillow fork) is a great way to interact with images in Python. We'll use PIL for accessing images, but do most of our processing by converting the images to Numpy arrays. Also, note that all of the images are the same size (200x200 px), which we've done to make your lives easier.

In [7]:
def get_image(row_or_str, root="data/images/"):
    # if we have an instance from the data frame, pull out the id
    # otherwise, it is a string for the image id
    if isinstance(row_or_str, pd.core.series.Series):
        row_id = row_or_str.name
    else:
        row_id = row_or_str
    
    filename = "{}.jpg".format(row_id)
    
    # test both of these so we don't have to specify. Image should be
    # in one of these two. If not, we let Image.open raise an exception.
    train_path = os.path.join(root, "train", filename)
    test_path = os.path.join(root, "test", filename)
    
    file_path = train_path if os.path.exists(train_path) else test_path
    
    return np.array(Image.open(file_path), dtype=np.int32)
        

# make sure it works for strings and series
assert (get_image(labels.iloc[0]) == get_image("520")).all()

# confirm images are the same shape
print "Size of image 520:  ", get_image("520").shape
print "Size of image 1001: ", get_image("1001").shape
Size of image 520:   (200, 200, 3)
Size of image 1001:  (200, 200, 3)
In [8]:
# imshow expects floats between 0-1, but we have
# integers 0-255 from PIL so we need to rescale to view the image
plt.imshow((get_image("520") / 255.))
Out[8]:
<matplotlib.image.AxesImage at 0x10c5d9950>

Build features from the image

Now we need to turn these images into something that a machine learning algorithm can understand. We've got a matrix of pixel values, but those don't contain enough interesting information on their own for most algorithms. We need to help the algorithms along by picking out some of the salient features for them.

We'll be building three kinds of features:

First, we'll start by extracting the color information since both HOG and DAISY work on grayscale images.

In [22]:
def extract_rgb_info(rgb, ax=None):
    """Extract color statistics as features:
        - pixel values (flattened)
        - X, Y sums per channel
        - percentiles per channel
        - percentile diffs per channel

        Plots if ax is passed
    """
    # toss alpha if it exists
    if rgb.shape[2] == 4:
        rgb = rgb[:, :, :3]

    # start with raw pixel values as features
    features = [rgb.flatten()]

    # add some basic statistics on the color channels (R, G, B)
    for channel in range(3):
        this_channel = rgb[:, :, channel].astype(np.float)
        sums = np.hstack([this_channel.sum(),
                          this_channel.sum(axis=0),
                          this_channel.sum(axis=1)])

        # percentiles
        ps = [1, 3, 5, 10, 50, 90, 95, 97, 99]
        percentiles = np.array(np.percentile(this_channel, ps))
        diff = percentiles[-4:] - percentiles[:4]
        
        # plot if we have been passed an axis
        if ax is not None:
            channel_name = ['r', 'g', 'b'][channel]       
            sns.kdeplot(this_channel.flatten(),
                        ax=ax,
                        label=channel_name,
                        color=channel_name)
            ax.set_title("Color channels")

        
        # store the features for this channel
        features += [sums, percentiles, diff]
        
    # return all the color features as a flat array
    return np.hstack(features).flatten()

extract_rgb_info(get_image("1974"), ax=plt.gca())
Out[22]:
array([ 201.,  194.,  127., ...,  179.,  177.,  171.])

Now we can build a big matrix that includes the raw pixel values, our extracted color information, and the HOG and DAISY feature descriptors. Lucky for us, there are great implementations of HOG and DAISY in the skimage package. They're also available as part of the most robust free computer vision library, OpenCV. OpenCV also has some very good Python tutorials, if you want to get started. However, the setup is just a bit more arduous than pip install scikit-image.

We'll create a function, preprocess, that will load an image and generate a flat array for all of the features we care about. Also, completely for free, I've included a demo paramater that will plot a pretty image of all of the features that we're constructing.

In [23]:
from skimage.feature import daisy
from skimage.feature import hog
from skimage.color import rgb2gray
from skimage.exposure import equalize_hist

def preprocess(img, demo=False):
    """ Turn raw pixel values into features.
    """
    
    def _demo_plot(img, stage="", is_ints=False, axes_idx=0):
        """ Utility to visualize the features we're building
        """
        if demo:
            axes[axes_idx].imshow(img / 255. if is_ints else img,
                                  cmap=bees_cm)
            axes[axes_idx].set_title(stage)
        return axes_idx + 1

    if demo:
        fig, axes = plt.subplots(3, 2, figsize=(15, 20))
        axes = axes.flatten()
    
    # track which subplot we're plotting to
    axes_idx = 0
    axes_idx = _demo_plot(img, stage="Raw Image", is_ints=True, axes_idx=axes_idx)
        
    # FEATURE 1: Raw image and color data    
    if demo:
        color_info = extract_rgb_info(img, ax=axes[axes_idx])
        axes_idx += 1
    else:
        color_info = extract_rgb_info(img)
    
    # remove color information (hog and daisy only work on grayscale)
    gray = rgb2gray(img)
    axes_idx = _demo_plot(gray, stage="Convert to grayscale", axes_idx=axes_idx)
    
    # equalize the image
    gray = equalize_hist(gray)
    axes_idx = _demo_plot(gray, stage="Equalized histogram", axes_idx=axes_idx)
    
    # FEATURE 2: histogram of oriented gradients features
    hog_features = hog(gray,
                       orientations=12,
                       pixels_per_cell=(8, 8),
                       cells_per_block=(1, 1),
                       visualise=demo)
    
    # if demo, we actually got a tuple back; unpack it and plot
    if demo:
        hog_features, hog_image = hog_features
        axes_idx = _demo_plot(hog_image, stage="HOG features", axes_idx=axes_idx)
        
    # FEATURE 3: DAISY features - sparser for demo so can be visualized
    params = {'step': 25, 'radius': 25, 'rings': 3} if demo \
             else {'step': 10, 'radius': 15, 'rings': 4}
    daisy_features = daisy(gray,
                           histograms=4,
                           orientations=8,
                           normalization='l1',
                           visualize=demo,
                           **params)
        
    if demo:
        daisy_features, daisy_image = daisy_features
        axes_idx = _demo_plot(daisy_image, stage="DAISY features", axes_idx=axes_idx)
    
    # return a flat array of the raw, hog and daisy features
    return np.hstack([color_info, hog_features, daisy_features.flatten()])


preprocess_result = preprocess(get_image("520"), demo=True)

Ooo! It's so pretty! But, now it's time to really get down to business. We need to turn all of our thousands of images into a BIG DATA MATRIX (Ok, not really big data. But, it is a big-ish matrix).

Also, now is a good chance to give a shout-out to the best tiny Python package that I have found in a while, tqdm. Stop writing counters with modulos to track long-running for-loops. Just wrap the iterator after the in in a call to tqdm(), and you magically get a progress bar. It's the perfect way to know if you have time to grab coffee.

In [24]:
def create_feature_matrix(label_dataframe):
    n_imgs = label_dataframe.shape[0]
    
    # initialized after first call to 
    feature_matrix = None
    
    for i, img_id in tqdm(enumerate(label_dataframe.index)):
        features = preprocess(get_image(img_id))
        
        # initialize the results matrix if we need to
        # this is so n_features can change as preprocess changes
        if feature_matrix is None:
            n_features = features.shape[0]
            feature_matrix = np.zeros((n_imgs, n_features), dtype=np.float32)
            
        if not features.shape[0] == n_features:
            print "Error on image {}".format(img_id)
            features = features[:n_features]
        
        feature_matrix[i, :] = features
        
    return feature_matrix

# turn those images into features!
bees_features = create_feature_matrix(labels)

# save in case we need to load later; creating this takes
# a little while and it ends up being ~3GBs
np.save("bees_features.npy", bees_features)
In [11]:
bees_features = np.load("bees_features.npy")

Our features aren't quite done yet. Many machine learning methods are built to work best with data that has a mean of 0 and unit variance. Luckily, scikit-learn provides a simple way to rescale your data to work well. They've got a more thorough explanation of why that is the linked docs.

In [34]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
bees_stand = ss.fit_transform(bees_features)
bees_stand.shape
Out[34]:
(3969, 168046)

Modeling using SVMs

Ok, now let's build our benchmark model. We partly chose SVMs because they are a popular method that we haven't covered before, and we partly chose them because it's easy to test both linear and non-linear classifiers with an SVM. Given that we've generated a huge number of features and (somewhat arbitrarily) decided to use an SVM, it may make you sad to see in the sklearn docs:

If the number of features is much greater than the number of samples, the method is likely to give poor performances.

So, let's reduce the number of features that we have. One way to select features, is to perform PCA. We'll aim to have an approximately square feature matrix by keeping 4,000 components.

In [17]:
from sklearn.decomposition import PCA

pca = PCA(n_components=4e3)
bees_pca = pca.fit_transform(bees_stand)

Remember that New York Times article about "data janitors"? It feels pretty real. We've spent 80% of our time cleaning and setting up our problem. And now, finally, we're getting to the model building.

First, we break the data into a local train/test split. Then, we use one of sklearn's killer features, GridSearchCV. The grid search allows us to to search over a set of possible parameters all at once (without manually tweaking and remembering the outcomes). For the SVM, we vary the C and gamma parameters--C controls the cost of misclassifying a particular image; gamma is a kernel parameter that defines the influence of any given point. For all of the possible parameters one could search, check out the SVC docs.

In [36]:
from sklearn import svm

from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

# split into a local train/test set
X_train, X_test, y_train, y_test = train_test_split(bees_pca,
                                                    labels.genus.values,
                                                    test_size=.2,
                                                    random_state=1234123)

# SVM parameters often set on log scale
params_to_try = {
    'C': [10**i for i in range(0, 2)],
    'gamma': [10**i for i in range(-7, -5)],
}

gs = GridSearchCV(svm.SVC(class_weight='auto', kernel='rbf', probability=True),
                  param_grid=params_to_try,
                  cv=3,
                  scoring='roc_auc',
                  n_jobs=-1)

gs.fit(X_train, y_train)

print "Best parameters:", gs.best_params_
print "Best score:", gs.best_score_
Best parameters: {'C': 1, 'gamma': 1e-06}
Best score: 0.678907848193

Now we'll take the best performing model and make probability predictions on our local holdout set.

In [18]:
preds = gs.best_estimator_.predict_proba(X_test)

We can score and plot our predictions using the competition metric, AUROC, to see how well our best model performs on this local holdout set.

In [20]:
from sklearn.metrics import roc_curve, auc

false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test,
                                                                preds[:, 1],
                                                                pos_label=1)

roc_auc = auc(false_positive_rate, true_positive_rate)

plt.title('Receiver Operating Characteristic')
plt.plot(false_positive_rate,
         true_positive_rate,
         label='AUC = {:0.2f}'.format(roc_auc))

plt.legend(loc=0)
plt.plot([0,1], [0,1], ls='--')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

Great! So we're getting about a 0.7 AUC from this model. That's a good start, but there's a lot of room to improve. In general, models with an AUC of 0.5 have no predictive power (that's the dotted green line) and models with perfect class separation have an AUC of 1.0.

The last thing we'll do is run the competition test set through our entire pipeline and make our submission on the DrivenData platform.

In [29]:
sf = pd.read_csv("data/SubmissionFormat.csv", index_col=0)

def make_predictions(submission_format_df,output_filepath="my_submission.csv"):
    prediction_df = submission_format_df.copy()
    
    # create features
    test_features = create_feature_matrix(submission_format_df)
    test_features_stand = ss.transform(test_features)   
    test_features_pca = pca.transform(test_features_stand)
    
    # predict with the best estimator from the grid search
    preds = gs.best_estimator_.predict_proba(test_features_pca)
    
    # copy the predictions to the dataframe with the ids and
    # write it out to a file
    prediction_df.genus = preds[:, 1]
    prediction_df.to_csv(output_filepath)
    
    return prediction_df
    
pred_df = make_predictions(sf, output_filepath="raw_hog_daisy_rbf.csv")
pred_df.head()

Out[29]:
genus
id
2783 0.847273
2175 0.875663
4517 0.879757
2831 0.732987
3556 0.828222

Yay, we scored around 70% accurate! We can see that our cross-validation and holdout scores were pretty reliable in this case. Now it's up to you to make the performance even better!

www.drivendata.org

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

tutorial

Goodnight Moon, Hello Early Literacy Screening Benchmark

In this guest post from the MIT Gabrieli Lab, we'll show you how to get started with the literacy screening challenge!

tutorial

Youth Mental Health: Automated Abstraction Benchmark

Learn how to process text narratives using open-source LLMs for the Youth Mental Health: Automated Abstraction challenge

winners

Meet the winners of Phase 1 of the PREPARE Challenge

Learn about the top datasets sourced for Phase 1 of the PREPARE Challenge.

resources

Teaching with DrivenData Competitions

Inspiration and resources for teaching students data science, machine learning, and AI skills with DrivenData competitions.

winners

Meet the winners of the Pose Bowl challenge

Learn about the top solutions submitted for the Pose Bowl: Spacecraft Detection and Pose Estimation Challenge.

winners

Meet the winners of the Water Supply Forecast Rodeo Hindcast Stage

Learn about the winning models for forecasting seasonal water supply from the first stage of the Water Supply Forecast Rodeo.

tools

Cookiecutter Data Science V2

Announcing the V2 release of Cookiecutter Data Science, the most widely adopted data science project template.

resources

How to make data science projects more open and inclusive

Key practices from the field of open science for making data science work more transparent, inclusive, and equitable.

winners

Meet the winners of the Kelp Wanted challenge

Dive into the solutions from the super segmenters who best detected kelp in Landsat imagery!

winners

Meet the winners of the SNOMED CT Entity Linking Challenge

Meet the winners with the best systems for detecting clinical terms in medical notes.

winners

Meet the winners of the Pale Blue Dot challenge

Learn about the top visuals created for the Pale Blue Dot: Visualization Challenge and the solvers behind them.

tutorial

NASA Pose Bowl - Benchmark

An introduction to the NASA Pose Bowl competition, with a benchmark solution for the object detection track

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.