Benchmark for Pover-T Test - Predicting Poverty

We're launching a new competition to predict poverty. In this post, we'll show you how to get started!

Casey Fitzpatrick
DrivenData

Measuring poverty is hard, but The World Bank plans to end extreme poverty by 2030 – and they need your help! Poverty levels are often estimated at the country level by extrapolating the results of surveys taken on a subset of the population at the household or individual level.

The surveys are incredibly informative, but they are also incredibly long. A typical poverty survey has hundreds of questions, ranging from region-specific questions to questions about the last time a participant bought bread. In order to track progress towards its goal, The World Bank needs the most efficient survey possible. That's where you come in.

For more information, explore this recent World Bank report on ending the cycle of poverty.

In our brand new competition, we're asking you to predict poverty at the household level by building a great classification model. The strongest poverty predictors could be used by statisticians at The World Bank to design new, shorter, equally informative surveys. With these improvements, The World Bank can more easily track progress towards their ambitious and inspiring goal.

In this post, we'll walk through a very simple first pass model for poverty prediction from survey data, showing you how to load the data, make some predictions, and then submit those predictions to the competition.

To get started, we summon the tools of the trade.

In [1]:
%matplotlib inline

import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# data directory
DATA_DIR = os.path.join('..', 'data', 'processed')

Loading the Data

Check out this research focused on children and the effort to end extreme poverty.

On the data download page, we provide a couple of datasets to get started:

  • Household-level survey data: This is obfuscated data from surveys conducted by The World Bank, focusing on household-level statistics. The data come from three different countries, and are separated into different files for convenience.
  • Individual-level survey data: This is obfuscated data from related surveys conducted by The World Bank, only these focus on individual-level statistics. The set of interviewees and countries involved are the same as the household data, as indicated by shared id indices, but this data includes detailed (obfuscated) information about household members.
  • Submission format: This gives us the filenames and columns of our submission prediction, filled with all 0.5 as a baseline.

In classic benchmark fashion, we're going to keep this analysis short, sweet, and simple. As such, we're not going to use any of the individual-level data that's included with the competition. Even without individual-level data, we have a lot of files to deal with. It's probably worth it to store our paths in an easily-accessible dictionary:

In [2]:
data_paths = {'A': {'train': os.path.join(DATA_DIR, 'A', 'A_hhold_train.csv'), 
                    'test':  os.path.join(DATA_DIR, 'A', 'A_hhold_test.csv')}, 
              
              'B': {'train': os.path.join(DATA_DIR, 'B', 'B_hhold_train.csv'), 
                    'test':  os.path.join(DATA_DIR, 'B', 'B_hhold_test.csv')}, 
              
              'C': {'train': os.path.join(DATA_DIR, 'C', 'C_hhold_train.csv'), 
                    'test':  os.path.join(DATA_DIR, 'C', 'C_hhold_test.csv')}}
In [3]:
# load training data
a_train = pd.read_csv(data_paths['A']['train'], index_col='id')
b_train = pd.read_csv(data_paths['B']['train'], index_col='id')
c_train = pd.read_csv(data_paths['C']['train'], index_col='id')

As usual, let's take a quick look at the head.

In [4]:
a_train.head()
Out[4]:
wBXbHZmp SlDKnCuu KAJOWiiw DsKacCdL rtPrBBPl tMJrvvut jdetlNNF maLAYXwi vwpsXRGk sArDRIyX ... sDGibZrP CsGvKKBJ OLpGAaEu LrDrWRjC JCDeZBXq HGPWuGlV GDUPaBQs WuwrCsIY AlDbXTlZ country
id
46107 JhtDR GUusz TuovO ZYabk feupP PHMVg NDTCU cLAGr XAmOF MwLvg ... JqHnW MaXfS etZsD idRwx LPtkN vkbkA qQxrL AITFl aQeIm A
82739 JhtDR GUusz TuovO ZYabk feupP PHMVg NDTCU sehIp lwCkE MwLvg ... JqHnW MaXfS HxnJy idRwx UyAms vkbkA qQxrL AITFl cecIq A
9646 JhtDR GUusz BIZns ZYabk uxuSS PHMVg NDTCU sehIp qNABl MwLvg ... JqHnW MaXfS USRak idRwx UyAms vkbkA qQxrL AITFl cecIq A
10975 JhtDR GUusz TuovO ZYabk feupP PHMVg NDTCU sehIp sPNOc MwLvg ... JqHnW MaXfS USRak idRwx UyAms vkbkA qQxrL AITFl cecIq A
16463 JhtDR alLXR TuovO ZYabk feupP PHMVg NDTCU cLAGr NdlDR MwLvg ... JqHnW MaXfS etZsD idRwx UyAms vkbkA qQxrL GAZGl aQeIm A

5 rows × 345 columns

In [5]:
b_train.head()
Out[5]:
RzaXNcgd LfWEhutI jXOqJdNL wJthinfa PTLgvdlQ ZvEApWrk euTESpHe bDVMMSYY aSzMhjgD ZehDbxxy ... YVMuyCUV AZVtosGB toZzckhe BkiXyuSp ggucvVUs VMvwrYds VlNidRNP rljjAmaN ChbSWYhO country
id
57071 zTghO pYfmQ lNhMv 42 RQnVj 103 jpSeC FDqwJ rxJJI IbWRL ... nZcTi pdvWY LLuZj qpzpO kZRgh VwGOP DScEf SKBnS Enull B
18973 zTghO pYfmQ lNhMv 34 iuxWN -2 OLVWN FDqwJ ufugi IbWRL ... nZcTi XrijK LLuZj qpzpO kZRgh VwGOP JOdCB SKBnS Enull B
20151 zTghO pYfmQ lNhMv 34 iuxWN 313 OMRWa FDqwJ rxJJI IbWRL ... nZcTi FEjSW lmjln qpzpO kZRgh VwGOP JOdCB SKBnS Enull B
5730 zTghO pYfmQ lNhMv 58 iuxWN 138 jpSeC FDqwJ rxJJI IbWRL ... nZcTi XrijK lmjln ZZzXr kZRgh VwGOP ZwQQe SKBnS Enull B
35033 zTghO pYfmQ lNhMv 122 iuxWN 68 OLVWN FDqwJ rxJJI IbWRL ... nZcTi CRHYU lmjln qpzpO kZRgh VwGOP WFgZH SKBnS Enull B

5 rows × 442 columns

In [6]:
c_train.head()
Out[6]:
GRGAYimk DNnBfiSI cNDTCUPU GvTJUYOo vmKoAlVH LhUIIEHQ DTNyjXJp PNAiwXUz ABnhybHK yiuxBjHP ... AJHrHUkH PaEKIlvv bFEsoTgJ ihACfisf obIQUcpS lAvdypjD ARWytYMz eqJPmiPb mmoCpqWS country
id
57211 RslOh SuNUt gJLrc EPKkJ qKiiE 7 XuMYE -5 QqETe umyco ... laFxs kBQRJ qcUVH AmPtx YXwVA jSoky NwjRA wnPqZ 52 C
62519 jPUAt boDkI gJLrc EPKkJ YXkKd 7 XuMYE 331 sEJgr yYwlq ... laFxs kBQRJ eusFW AmPtx LSPRW jSoky NwjRA wnPqZ 100 C
11614 OpTiw boDkI vURog EPKkJ qKiiE 9 XuMYE -1 sEJgr umyco ... laFxs oUXSJ eusFW AmPtx YXwVA jSoky NwjRA wnPqZ 70 C
6470 RslOh VgxgY gJLrc EPKkJ YXkKd 9 zfhKi -5 sEJgr umyco ... laFxs kBQRJ jqrBN AmPtx YXwVA jSoky NwjRA wnPqZ 10 C
33558 IXFlv VgxgY kPTaD EPKkJ YXkKd 9 XuMYE 23 sEJgr umyco ... laFxs kBQRJ eusFW AmPtx LSPRW jSoky herus wnPqZ -5 C

5 rows × 164 columns

The first thing to notice is that each country's surveys have wildly different numbers of columns, so we'll plan on training separate models for each country and combining our predictions for submission at the end.

Poverty Distributions

Let's take a look at the class distributions for each country. In classification tasks, it's crucial to know the balance of class labels!

In [7]:
a_train.poor.value_counts().plot.bar(title='Number of Poor for country A')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x116015fd0>
In [8]:
b_train.poor.value_counts().plot.bar(title='Number of Poor for country B')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1175d5e48>
In [9]:
c_train.poor.value_counts().plot.bar(title='Number of Poor for country C')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x118300128>

Country A is well-balanced, but countries B and C are quite unbalanced. This could definitely impact the confidence of our predictor. But solving that problem is up to you – it's outside the scope of this humble benchmark.

We expect most of the data types here to be the dreaded object type, but let's make sure.

In [10]:
a_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8203 entries, 46107 to 39832
Columns: 345 entries, wBXbHZmp to country
dtypes: bool(1), float64(2), int64(2), object(340)
memory usage: 21.6+ MB
In [11]:
b_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3255 entries, 57071 to 4923
Columns: 442 entries, RzaXNcgd to country
dtypes: bool(1), float64(9), int64(14), object(418)
memory usage: 11.0+ MB
In [12]:
c_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6469 entries, 57211 to 7646
Columns: 164 entries, GRGAYimk to country
dtypes: bool(1), float64(1), int64(29), object(133)
memory usage: 8.1+ MB

Sure enough, the bool types are our labels--the poor column--then there are a few numeric types with the rest being object. We'll need to convert the object columns to categorical variables before training anything.

Pre-process the Data

We're going to do some simple pre-processing here. Standardizing the data and converting the object types to categoricals should get us pretty far. Let's write a couple of simple functions to help this effort.

In [13]:
# Standardize features
def standardize(df, numeric_only=True):
    numeric = df.select_dtypes(include=['int64', 'float64'])
    
    # subtracy mean and divide by std
    df[numeric.columns] = (numeric - numeric.mean()) / numeric.std()
    
    return df
    

def pre_process_data(df, enforce_cols=None):
    print("Input shape:\t{}".format(df.shape))
        

    df = standardize(df)
    print("After standardization {}".format(df.shape))
        
    # create dummy variables for categoricals
    df = pd.get_dummies(df)
    print("After converting categoricals:\t{}".format(df.shape))
    

    # match test set and training set columns
    if enforce_cols is not None:
        to_drop = np.setdiff1d(df.columns, enforce_cols)
        to_add = np.setdiff1d(enforce_cols, df.columns)

        df.drop(to_drop, axis=1, inplace=True)
        df = df.assign(**{c: 0 for c in to_add})
    
    df.fillna(0, inplace=True)
    
    return df

Time to convert these surveys!

In [14]:
print("Country A")
aX_train = pre_process_data(a_train.drop('poor', axis=1))
ay_train = np.ravel(a_train.poor)

print("\nCountry B")
bX_train = pre_process_data(b_train.drop('poor', axis=1))
by_train = np.ravel(b_train.poor)

print("\nCountry C")
cX_train = pre_process_data(c_train.drop('poor', axis=1))
cy_train = np.ravel(c_train.poor)
Country A
Input shape:	(8203, 344)
After standardization (8203, 344)
After converting categoricals:	(8203, 859)

Country B
Input shape:	(3255, 441)
After standardization (3255, 441)
After converting categoricals:	(3255, 1432)

Country C
Input shape:	(6469, 163)
After standardization (6469, 163)
After converting categoricals:	(6469, 795)

The data is probably looking pretty different now. Let's take a peek at country A.

In [15]:
aX_train.head()
Out[15]:
nEsgxvAq OMtioXZZ YFMZwKrU TiwRslOh wBXbHZmp_DkQlr wBXbHZmp_JhtDR SlDKnCuu_GUusz SlDKnCuu_alLXR KAJOWiiw_BIZns KAJOWiiw_TuovO ... JCDeZBXq_UyAms HGPWuGlV_WKNwg HGPWuGlV_vkbkA GDUPaBQs_qCEuA GDUPaBQs_qQxrL WuwrCsIY_AITFl WuwrCsIY_GAZGl AlDbXTlZ_aQeIm AlDbXTlZ_cecIq country_A
id
46107 -1.447160 0.325746 1.099716 -0.628045 0 1 1 0 0 1 ... 0 0 1 0 1 1 0 1 0 1
82739 -0.414625 -0.503468 -0.016050 0.713467 0 1 1 0 0 1 ... 1 0 1 0 1 1 0 0 1 1
9646 0.617910 -0.503468 -0.016050 -0.628045 0 1 1 0 1 0 ... 1 0 1 0 1 1 0 0 1 1
10975 0.617910 -1.332682 -1.131816 0.713467 0 1 1 0 0 1 ... 1 0 1 0 1 1 0 0 1 1
16463 0.617910 0.325746 -1.131816 -0.180874 0 1 0 1 0 1 ... 1 0 1 0 1 0 1 1 0 1

5 rows × 859 columns

Oh yeah, now that looks like the kind of matrix scikit-learn wants to process!

The Error Metric - MeanLogLoss

The error metric for this competition is our old friend, log loss ... with a twist. Since we're predicting for three countries, our overall score is going to be the mean of the log losses for each country. However, the countries labels are conditionally independent, so in practice we should be able to train three independent models and combine their predictions for submission.

See the competition submission page for more info on the metric!

Build the Model

As mentioned above, we're keeping this benchmark short, sweet, and simple. So where do we turn when looking for a great out-of-the-box model? If you answered "Random Forests!" then we may just be two trees of the same ensemble. No? Then perhaps we're... splitting on the same node? At any rate, random forests are often a good model to try first, especially when we have numeric and categorical variables in our feature space.

Random Forest

In scikit-learn, it almost couldn't be easier to grow a random forest with a few lines of code.

In [16]:
from sklearn.ensemble import RandomForestClassifier

def train_model(features, labels, **kwargs):
    
    # instantiate model
    model = RandomForestClassifier(n_estimators=50, random_state=0)
    
    # train model
    model.fit(features, labels)
    
    # get a (not-very-useful) sense of performance
    accuracy = model.score(features, labels)
    print(f"In-sample accuracy: {accuracy:0.2%}")
    
    return model
Another classic from xkcd.

That's it as far model building is concerned. Let's grow some trees!

In [17]:
model_a = train_model(aX_train, ay_train)
In-sample accuracy: 100.00%
In [18]:
model_b = train_model(bX_train, by_train)
In-sample accuracy: 99.94%
In [19]:
model_c = train_model(cX_train, cy_train)
In-sample accuracy: 100.00%

Time to Predict and Submit

Remember, accuracy is not a very informative metric, especially when dealing with imbalanced classes. Furthermore, accuracy is not the metric for this competition!

The above scores suggest little more than an overfit training set. But it's confidence that counts – we'll need to use the .predict_proba() method to generate our submissions. Let's load up the test data, process it, and see what we get.

In [20]:
# load test data
a_test = pd.read_csv(data_paths['A']['test'], index_col='id')
b_test = pd.read_csv(data_paths['B']['test'], index_col='id')
c_test = pd.read_csv(data_paths['C']['test'], index_col='id')
In [21]:
# process the test data
a_test = pre_process_data(a_test, enforce_cols=aX_train.columns)
b_test = pre_process_data(b_test, enforce_cols=bX_train.columns)
c_test = pre_process_data(c_test, enforce_cols=cX_train.columns)
Input shape:	(4041, 344)
After standardization (4041, 344)
After converting categoricals:	(4041, 851)
Input shape:	(1604, 441)
After standardization (1604, 441)
After converting categoricals:	(1604, 1419)
Input shape:	(3187, 163)
After standardization (3187, 163)
After converting categoricals:	(3187, 773)

Note that we're taking a very simple approach to filling missing values, as well as enforcing column consistency after converting to categoricals. (See the preprocessing function again to see what enforce_cols actually does.)

Make Predictions

To return the confidence probabilities that the submission format requires, we need to call the predict_proba() method on our models.

In [22]:
a_preds = model_a.predict_proba(a_test)
b_preds = model_b.predict_proba(b_test)
c_preds = model_c.predict_proba(c_test)

That was easy enough. Time to format the predictions and send them on their way.

Save Submission

We'll write a simple function that converts the predictions a DataFrame and adds a column for the correct country code.

In [23]:
def make_country_sub(preds, test_feat, country):
    # make sure we code the country correctly
    country_codes = ['A', 'B', 'C']
    
    # get just the poor probabilities
    country_sub = pd.DataFrame(data=preds[:, 1],  # proba p=1
                               columns=['poor'], 
                               index=test_feat.index)

    
    # add the country code for joining later
    country_sub["country"] = country
    return country_sub[["country", "poor"]]
In [24]:
# convert preds to data frames
a_sub = make_country_sub(a_preds, a_test, 'A')
b_sub = make_country_sub(b_preds, b_test, 'B')
c_sub = make_country_sub(c_preds, c_test, 'C')

Finally, it's time to combine our predictions and save for submission!

In [25]:
submission = pd.concat([a_sub, b_sub, c_sub])

How about one last look at the fruits of or hard work...

In [26]:
submission.head()
Out[26]:
country poor
id
418 A 0.32
41249 A 0.28
16205 A 0.26
97501 A 0.36
67756 A 0.26
In [27]:
submission.tail()
Out[27]:
country poor
id
6775 C 0.30
88300 C 0.20
35424 C 0.20
81668 C 0.28
98377 C 0.18

Looks good, let's save and send'er off!

In [28]:
submission.to_csv('submission.csv')

Submit to Leaderboard

Woohoo! It's a start! And that's exactly what we intend with these benchmarks. We're sure you'll be able to top this model in no time, and we can't wait to see what you come up with.

Visit The World Bank's site to learn more about how poverty is measured.

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

insights

Life beyond the leaderboard

What happens to winning solutions after a machine learning competition?

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

tutorial

Goodnight Moon, Hello Early Literacy Screening Benchmark

In this guest post from the MIT Gabrieli Lab, we'll show you how to get started with the literacy screening challenge!

tutorial

Youth Mental Health: Automated Abstraction Benchmark

Learn how to process text narratives using open-source LLMs for the Youth Mental Health: Automated Abstraction challenge

winners

Meet the winners of Phase 1 of the PREPARE Challenge

Learn about the top datasets sourced for Phase 1 of the PREPARE Challenge.

resources

Teaching with DrivenData Competitions

Inspiration and resources for teaching students data science, machine learning, and AI skills with DrivenData competitions.

winners

Meet the winners of the Pose Bowl challenge

Learn about the top solutions submitted for the Pose Bowl: Spacecraft Detection and Pose Estimation Challenge.

winners

Meet the winners of the Water Supply Forecast Rodeo Hindcast Stage

Learn about the winning models for forecasting seasonal water supply from the first stage of the Water Supply Forecast Rodeo.

tools

Cookiecutter Data Science V2

Announcing the V2 release of Cookiecutter Data Science, the most widely adopted data science project template.

resources

How to make data science projects more open and inclusive

Key practices from the field of open science for making data science work more transparent, inclusive, and equitable.

winners

Meet the winners of the Kelp Wanted challenge

Dive into the solutions from the super segmenters who best detected kelp in Landsat imagery!

winners

Meet the winners of the SNOMED CT Entity Linking Challenge

Meet the winners with the best systems for detecting clinical terms in medical notes.

winners

Meet the winners of the Pale Blue Dot challenge

Learn about the top visuals created for the Pale Blue Dot: Visualization Challenge and the solvers behind them.

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.