Benchmark for Pover-T Test - Predicting Poverty

Measuring poverty is hard, but The World Bank plans to end extreme poverty by 2030 – and they need your help! Poverty levels are often estimated at the country level by extrapolating the results of surveys taken on a subset of the population at the household or individual level.

The surveys are incredibly informative, but they are also incredibly long. A typical poverty survey has hundreds of questions, ranging from region-specific questions to questions about the last time a participant bought bread. In order to track progress towards its goal, The World Bank needs the most efficient survey possible. That's where you come in.

For more information, explore this recent World Bank report on ending the cycle of poverty.

In our brand new competition, we're asking you to predict poverty at the household level by building a great classification model. The strongest poverty predictors could be used by statisticians at The World Bank to design new, shorter, equally informative surveys. With these improvements, The World Bank can more easily track progress towards their ambitious and inspiring goal.

In this post, we'll walk through a very simple first pass model for poverty prediction from survey data, showing you how to load the data, make some predictions, and then submit those predictions to the competition.

To get started, we summon the tools of the trade.

In [1]:
%matplotlib inline

import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# data directory
DATA_DIR = os.path.join('..', 'data', 'processed')

Loading the Data

Check out this research focused on children and the effort to end extreme poverty.

On the data download page, we provide a couple of datasets to get started:

  • Household-level survey data: This is obfuscated data from surveys conducted by The World Bank, focusing on household-level statistics. The data come from three different countries, and are separated into different files for convenience.
  • Individual-level survey data: This is obfuscated data from related surveys conducted by The World Bank, only these focus on individual-level statistics. The set of interviewees and countries involved are the same as the household data, as indicated by shared id indices, but this data includes detailed (obfuscated) information about household members.
  • Submission format: This gives us the filenames and columns of our submission prediction, filled with all 0.5 as a baseline.

In classic benchmark fashion, we're going to keep this analysis short, sweet, and simple. As such, we're not going to use any of the individual-level data that's included with the competition. Even without individual-level data, we have a lot of files to deal with. It's probably worth it to store our paths in an easily-accessible dictionary:

In [2]:
data_paths = {'A': {'train': os.path.join(DATA_DIR, 'A', 'A_hhold_train.csv'), 
                    'test':  os.path.join(DATA_DIR, 'A', 'A_hhold_test.csv')}, 
              'B': {'train': os.path.join(DATA_DIR, 'B', 'B_hhold_train.csv'), 
                    'test':  os.path.join(DATA_DIR, 'B', 'B_hhold_test.csv')}, 
              'C': {'train': os.path.join(DATA_DIR, 'C', 'C_hhold_train.csv'), 
                    'test':  os.path.join(DATA_DIR, 'C', 'C_hhold_test.csv')}}
In [3]:
# load training data
a_train = pd.read_csv(data_paths['A']['train'], index_col='id')
b_train = pd.read_csv(data_paths['B']['train'], index_col='id')
c_train = pd.read_csv(data_paths['C']['train'], index_col='id')

As usual, let's take a quick look at the head.

In [4]:
wBXbHZmp SlDKnCuu KAJOWiiw DsKacCdL rtPrBBPl tMJrvvut jdetlNNF maLAYXwi vwpsXRGk sArDRIyX ... sDGibZrP CsGvKKBJ OLpGAaEu LrDrWRjC JCDeZBXq HGPWuGlV GDUPaBQs WuwrCsIY AlDbXTlZ country
46107 JhtDR GUusz TuovO ZYabk feupP PHMVg NDTCU cLAGr XAmOF MwLvg ... JqHnW MaXfS etZsD idRwx LPtkN vkbkA qQxrL AITFl aQeIm A
82739 JhtDR GUusz TuovO ZYabk feupP PHMVg NDTCU sehIp lwCkE MwLvg ... JqHnW MaXfS HxnJy idRwx UyAms vkbkA qQxrL AITFl cecIq A
9646 JhtDR GUusz BIZns ZYabk uxuSS PHMVg NDTCU sehIp qNABl MwLvg ... JqHnW MaXfS USRak idRwx UyAms vkbkA qQxrL AITFl cecIq A
10975 JhtDR GUusz TuovO ZYabk feupP PHMVg NDTCU sehIp sPNOc MwLvg ... JqHnW MaXfS USRak idRwx UyAms vkbkA qQxrL AITFl cecIq A
16463 JhtDR alLXR TuovO ZYabk feupP PHMVg NDTCU cLAGr NdlDR MwLvg ... JqHnW MaXfS etZsD idRwx UyAms vkbkA qQxrL GAZGl aQeIm A

5 rows × 345 columns

In [5]:
RzaXNcgd LfWEhutI jXOqJdNL wJthinfa PTLgvdlQ ZvEApWrk euTESpHe bDVMMSYY aSzMhjgD ZehDbxxy ... YVMuyCUV AZVtosGB toZzckhe BkiXyuSp ggucvVUs VMvwrYds VlNidRNP rljjAmaN ChbSWYhO country
57071 zTghO pYfmQ lNhMv 42 RQnVj 103 jpSeC FDqwJ rxJJI IbWRL ... nZcTi pdvWY LLuZj qpzpO kZRgh VwGOP DScEf SKBnS Enull B
18973 zTghO pYfmQ lNhMv 34 iuxWN -2 OLVWN FDqwJ ufugi IbWRL ... nZcTi XrijK LLuZj qpzpO kZRgh VwGOP JOdCB SKBnS Enull B
20151 zTghO pYfmQ lNhMv 34 iuxWN 313 OMRWa FDqwJ rxJJI IbWRL ... nZcTi FEjSW lmjln qpzpO kZRgh VwGOP JOdCB SKBnS Enull B
5730 zTghO pYfmQ lNhMv 58 iuxWN 138 jpSeC FDqwJ rxJJI IbWRL ... nZcTi XrijK lmjln ZZzXr kZRgh VwGOP ZwQQe SKBnS Enull B
35033 zTghO pYfmQ lNhMv 122 iuxWN 68 OLVWN FDqwJ rxJJI IbWRL ... nZcTi CRHYU lmjln qpzpO kZRgh VwGOP WFgZH SKBnS Enull B

5 rows × 442 columns

In [6]:
GRGAYimk DNnBfiSI cNDTCUPU GvTJUYOo vmKoAlVH LhUIIEHQ DTNyjXJp PNAiwXUz ABnhybHK yiuxBjHP ... AJHrHUkH PaEKIlvv bFEsoTgJ ihACfisf obIQUcpS lAvdypjD ARWytYMz eqJPmiPb mmoCpqWS country
57211 RslOh SuNUt gJLrc EPKkJ qKiiE 7 XuMYE -5 QqETe umyco ... laFxs kBQRJ qcUVH AmPtx YXwVA jSoky NwjRA wnPqZ 52 C
62519 jPUAt boDkI gJLrc EPKkJ YXkKd 7 XuMYE 331 sEJgr yYwlq ... laFxs kBQRJ eusFW AmPtx LSPRW jSoky NwjRA wnPqZ 100 C
11614 OpTiw boDkI vURog EPKkJ qKiiE 9 XuMYE -1 sEJgr umyco ... laFxs oUXSJ eusFW AmPtx YXwVA jSoky NwjRA wnPqZ 70 C
6470 RslOh VgxgY gJLrc EPKkJ YXkKd 9 zfhKi -5 sEJgr umyco ... laFxs kBQRJ jqrBN AmPtx YXwVA jSoky NwjRA wnPqZ 10 C
33558 IXFlv VgxgY kPTaD EPKkJ YXkKd 9 XuMYE 23 sEJgr umyco ... laFxs kBQRJ eusFW AmPtx LSPRW jSoky herus wnPqZ -5 C

5 rows × 164 columns

The first thing to notice is that each country's surveys have wildly different numbers of columns, so we'll plan on training separate models for each country and combining our predictions for submission at the end.

Poverty Distributions

Let's take a look at the class distributions for each country. In classification tasks, it's crucial to know the balance of class labels!

In [7]:
a_train.poor.value_counts()'Number of Poor for country A')
<matplotlib.axes._subplots.AxesSubplot at 0x116015fd0>
In [8]:
b_train.poor.value_counts()'Number of Poor for country B')
<matplotlib.axes._subplots.AxesSubplot at 0x1175d5e48>
In [9]:
c_train.poor.value_counts()'Number of Poor for country C')
<matplotlib.axes._subplots.AxesSubplot at 0x118300128>

Country A is well-balanced, but countries B and C are quite unbalanced. This could definitely impact the confidence of our predictor. But solving that problem is up to you – it's outside the scope of this humble benchmark.

We expect most of the data types here to be the dreaded object type, but let's make sure.

In [10]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8203 entries, 46107 to 39832
Columns: 345 entries, wBXbHZmp to country
dtypes: bool(1), float64(2), int64(2), object(340)
memory usage: 21.6+ MB
In [11]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3255 entries, 57071 to 4923
Columns: 442 entries, RzaXNcgd to country
dtypes: bool(1), float64(9), int64(14), object(418)
memory usage: 11.0+ MB
In [12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6469 entries, 57211 to 7646
Columns: 164 entries, GRGAYimk to country
dtypes: bool(1), float64(1), int64(29), object(133)
memory usage: 8.1+ MB

Sure enough, the bool types are our labels--the poor column--then there are a few numeric types with the rest being object. We'll need to convert the object columns to categorical variables before training anything.

Pre-process the Data

We're going to do some simple pre-processing here. Standardizing the data and converting the object types to categoricals should get us pretty far. Let's write a couple of simple functions to help this effort.

In [13]:
# Standardize features
def standardize(df, numeric_only=True):
    numeric = df.select_dtypes(include=['int64', 'float64'])
    # subtracy mean and divide by std
    df[numeric.columns] = (numeric - numeric.mean()) / numeric.std()
    return df

def pre_process_data(df, enforce_cols=None):
    print("Input shape:\t{}".format(df.shape))

    df = standardize(df)
    print("After standardization {}".format(df.shape))
    # create dummy variables for categoricals
    df = pd.get_dummies(df)
    print("After converting categoricals:\t{}".format(df.shape))

    # match test set and training set columns
    if enforce_cols is not None:
        to_drop = np.setdiff1d(df.columns, enforce_cols)
        to_add = np.setdiff1d(enforce_cols, df.columns)

        df.drop(to_drop, axis=1, inplace=True)
        df = df.assign(**{c: 0 for c in to_add})
    df.fillna(0, inplace=True)
    return df

Time to convert these surveys!

In [14]:
print("Country A")
aX_train = pre_process_data(a_train.drop('poor', axis=1))
ay_train = np.ravel(a_train.poor)

print("\nCountry B")
bX_train = pre_process_data(b_train.drop('poor', axis=1))
by_train = np.ravel(b_train.poor)

print("\nCountry C")
cX_train = pre_process_data(c_train.drop('poor', axis=1))
cy_train = np.ravel(c_train.poor)
Country A
Input shape:	(8203, 344)
After standardization (8203, 344)
After converting categoricals:	(8203, 859)

Country B
Input shape:	(3255, 441)
After standardization (3255, 441)
After converting categoricals:	(3255, 1432)

Country C
Input shape:	(6469, 163)
After standardization (6469, 163)
After converting categoricals:	(6469, 795)

The data is probably looking pretty different now. Let's take a peek at country A.

In [15]:
nEsgxvAq OMtioXZZ YFMZwKrU TiwRslOh wBXbHZmp_DkQlr wBXbHZmp_JhtDR SlDKnCuu_GUusz SlDKnCuu_alLXR KAJOWiiw_BIZns KAJOWiiw_TuovO ... JCDeZBXq_UyAms HGPWuGlV_WKNwg HGPWuGlV_vkbkA GDUPaBQs_qCEuA GDUPaBQs_qQxrL WuwrCsIY_AITFl WuwrCsIY_GAZGl AlDbXTlZ_aQeIm AlDbXTlZ_cecIq country_A
46107 -1.447160 0.325746 1.099716 -0.628045 0 1 1 0 0 1 ... 0 0 1 0 1 1 0 1 0 1
82739 -0.414625 -0.503468 -0.016050 0.713467 0 1 1 0 0 1 ... 1 0 1 0 1 1 0 0 1 1
9646 0.617910 -0.503468 -0.016050 -0.628045 0 1 1 0 1 0 ... 1 0 1 0 1 1 0 0 1 1
10975 0.617910 -1.332682 -1.131816 0.713467 0 1 1 0 0 1 ... 1 0 1 0 1 1 0 0 1 1
16463 0.617910 0.325746 -1.131816 -0.180874 0 1 0 1 0 1 ... 1 0 1 0 1 0 1 1 0 1

5 rows × 859 columns

Oh yeah, now that looks like the kind of matrix scikit-learn wants to process!

The Error Metric - MeanLogLoss

The error metric for this competition is our old friend, log loss ... with a twist. Since we're predicting for three countries, our overall score is going to be the mean of the log losses for each country. However, the countries labels are conditionally independent, so in practice we should be able to train three independent models and combine their predictions for submission.

See the competition submission page for more info on the metric!

Build the Model

As mentioned above, we're keeping this benchmark short, sweet, and simple. So where do we turn when looking for a great out-of-the-box model? If you answered "Random Forests!" then we may just be two trees of the same ensemble. No? Then perhaps we're... splitting on the same node? At any rate, random forests are often a good model to try first, especially when we have numeric and categorical variables in our feature space.

Random Forest

In scikit-learn, it almost couldn't be easier to grow a random forest with a few lines of code.

In [16]:
from sklearn.ensemble import RandomForestClassifier

def train_model(features, labels, **kwargs):
    # instantiate model
    model = RandomForestClassifier(n_estimators=50, random_state=0)
    # train model, labels)
    # get a (not-very-useful) sense of performance
    accuracy = model.score(features, labels)
    print(f"In-sample accuracy: {accuracy:0.2%}")
    return model
Another classic from xkcd.

That's it as far model building is concerned. Let's grow some trees!

In [17]:
model_a = train_model(aX_train, ay_train)
In-sample accuracy: 100.00%
In [18]:
model_b = train_model(bX_train, by_train)
In-sample accuracy: 99.94%
In [19]:
model_c = train_model(cX_train, cy_train)
In-sample accuracy: 100.00%

Time to Predict and Submit

Remember, accuracy is not a very informative metric, especially when dealing with imbalanced classes. Furthermore, accuracy is not the metric for this competition!

The above scores suggest little more than an overfit training set. But it's confidence that counts – we'll need to use the .predict_proba() method to generate our submissions. Let's load up the test data, process it, and see what we get.

In [20]:
# load test data
a_test = pd.read_csv(data_paths['A']['test'], index_col='id')
b_test = pd.read_csv(data_paths['B']['test'], index_col='id')
c_test = pd.read_csv(data_paths['C']['test'], index_col='id')
In [21]:
# process the test data
a_test = pre_process_data(a_test, enforce_cols=aX_train.columns)
b_test = pre_process_data(b_test, enforce_cols=bX_train.columns)
c_test = pre_process_data(c_test, enforce_cols=cX_train.columns)
Input shape:	(4041, 344)
After standardization (4041, 344)
After converting categoricals:	(4041, 851)
Input shape:	(1604, 441)
After standardization (1604, 441)
After converting categoricals:	(1604, 1419)
Input shape:	(3187, 163)
After standardization (3187, 163)
After converting categoricals:	(3187, 773)

Note that we're taking a very simple approach to filling missing values, as well as enforcing column consistency after converting to categoricals. (See the preprocessing function again to see what enforce_cols actually does.)

Make Predictions

To return the confidence probabilities that the submission format requires, we need to call the predict_proba() method on our models.

In [22]:
a_preds = model_a.predict_proba(a_test)
b_preds = model_b.predict_proba(b_test)
c_preds = model_c.predict_proba(c_test)

That was easy enough. Time to format the predictions and send them on their way.

Save Submission

We'll write a simple function that converts the predictions a DataFrame and adds a column for the correct country code.

In [23]:
def make_country_sub(preds, test_feat, country):
    # make sure we code the country correctly
    country_codes = ['A', 'B', 'C']
    # get just the poor probabilities
    country_sub = pd.DataFrame(data=preds[:, 1],  # proba p=1

    # add the country code for joining later
    country_sub["country"] = country
    return country_sub[["country", "poor"]]
In [24]:
# convert preds to data frames
a_sub = make_country_sub(a_preds, a_test, 'A')
b_sub = make_country_sub(b_preds, b_test, 'B')
c_sub = make_country_sub(c_preds, c_test, 'C')

Finally, it's time to combine our predictions and save for submission!

In [25]:
submission = pd.concat([a_sub, b_sub, c_sub])

How about one last look at the fruits of or hard work...

In [26]:
country poor
418 A 0.32
41249 A 0.28
16205 A 0.26
97501 A 0.36
67756 A 0.26
In [27]:
country poor
6775 C 0.30
88300 C 0.20
35424 C 0.20
81668 C 0.28
98377 C 0.18

Looks good, let's save and send'er off!

In [28]:

Submit to Leaderboard

Woohoo! It's a start! And that's exactly what we intend with these benchmarks. We're sure you'll be able to top this model in no time, and we can't wait to see what you come up with.

Visit The World Bank's site to learn more about how poverty is measured.