Measuring poverty is hard, but The World Bank plans to end extreme poverty by 2030 – and they need your help! Poverty levels are often estimated at the country level by extrapolating the results of surveys taken on a subset of the population at the household or individual level.

The surveys are incredibly informative, but they are also incredibly long. A typical poverty survey has hundreds of questions, ranging from region-specific questions to questions about the last time a participant bought bread. In order to track progress towards its goal, The World Bank needs the most efficient survey possible. That's where you come in.

For more information, explore this recent World Bank report on ending the cycle of poverty.

In our brand new competition, we're asking you to predict poverty at the household level by building a great classification model. The strongest poverty predictors could be used by statisticians at The World Bank to design new, shorter, equally informative surveys. With these improvements, The World Bank can more easily track progress towards their ambitious and inspiring goal.

In this post, we'll walk through a very simple first pass model for poverty prediction from survey data, showing you how to load the data, make some predictions, and then submit those predictions to the competition.

To get started, we summon the tools of the trade.

In [1]:

%matplotlib inline

import os

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

# data directory
DATA_DIR = os.path.join("..", "data", "processed")

Loading the Data¶

Check out this research focused on children and the effort to end extreme poverty.

On the data download page, we provide a couple of datasets to get started:

Household-level survey data: This is obfuscated data from surveys conducted by The World Bank, focusing on household-level statistics. The data come from three different countries, and are separated into different files for convenience.
Individual-level survey data: This is obfuscated data from related surveys conducted by The World Bank, only these focus on individual-level statistics. The set of interviewees and countries involved are the same as the household data, as indicated by shared id indices, but this data includes detailed (obfuscated) information about household members.
Submission format: This gives us the filenames and columns of our submission prediction, filled with all 0.5 as a baseline.

In classic benchmark fashion, we're going to keep this analysis short, sweet, and simple. As such, we're not going to use any of the individual-level data that's included with the competition. Even without individual-level data, we have a lot of files to deal with. It's probably worth it to store our paths in an easily-accessible dictionary:

In [2]:

data_paths = {
    "A": {
        "train": os.path.join(DATA_DIR, "A", "A_hhold_train.csv"),
        "test": os.path.join(DATA_DIR, "A", "A_hhold_test.csv"),
    },
    "B": {
        "train": os.path.join(DATA_DIR, "B", "B_hhold_train.csv"),
        "test": os.path.join(DATA_DIR, "B", "B_hhold_test.csv"),
    },
    "C": {
        "train": os.path.join(DATA_DIR, "C", "C_hhold_train.csv"),
        "test": os.path.join(DATA_DIR, "C", "C_hhold_test.csv"),
    },
}

In [3]:

# load training data
a_train = pd.read_csv(data_paths["A"]["train"], index_col="id")
b_train = pd.read_csv(data_paths["B"]["train"], index_col="id")
c_train = pd.read_csv(data_paths["C"]["train"], index_col="id")

As usual, let's take a quick look at the head.

In [4]:

a_train.head()

Out[4]:

	wBXbHZmp	SlDKnCuu	KAJOWiiw	DsKacCdL	rtPrBBPl	tMJrvvut	jdetlNNF	maLAYXwi	vwpsXRGk	sArDRIyX	...	sDGibZrP	CsGvKKBJ	OLpGAaEu	LrDrWRjC	JCDeZBXq	HGPWuGlV	GDUPaBQs	WuwrCsIY	AlDbXTlZ	country
id
46107	JhtDR	GUusz	TuovO	ZYabk	feupP	PHMVg	NDTCU	cLAGr	XAmOF	MwLvg	...	JqHnW	MaXfS	etZsD	idRwx	LPtkN	vkbkA	qQxrL	AITFl	aQeIm	A
82739	JhtDR	GUusz	TuovO	ZYabk	feupP	PHMVg	NDTCU	sehIp	lwCkE	MwLvg	...	JqHnW	MaXfS	HxnJy	idRwx	UyAms	vkbkA	qQxrL	AITFl	cecIq	A
9646	JhtDR	GUusz	BIZns	ZYabk	uxuSS	PHMVg	NDTCU	sehIp	qNABl	MwLvg	...	JqHnW	MaXfS	USRak	idRwx	UyAms	vkbkA	qQxrL	AITFl	cecIq	A
10975	JhtDR	GUusz	TuovO	ZYabk	feupP	PHMVg	NDTCU	sehIp	sPNOc	MwLvg	...	JqHnW	MaXfS	USRak	idRwx	UyAms	vkbkA	qQxrL	AITFl	cecIq	A
16463	JhtDR	alLXR	TuovO	ZYabk	feupP	PHMVg	NDTCU	cLAGr	NdlDR	MwLvg	...	JqHnW	MaXfS	etZsD	idRwx	UyAms	vkbkA	qQxrL	GAZGl	aQeIm	A

5 rows × 345 columns

In [5]:

b_train.head()

Out[5]:

	RzaXNcgd	LfWEhutI	jXOqJdNL	wJthinfa	PTLgvdlQ	ZvEApWrk	euTESpHe	bDVMMSYY	aSzMhjgD	ZehDbxxy	...	YVMuyCUV	AZVtosGB	toZzckhe	BkiXyuSp	ggucvVUs	VMvwrYds	VlNidRNP	rljjAmaN	ChbSWYhO	country
id
57071	zTghO	pYfmQ	lNhMv	42	RQnVj	103	jpSeC	FDqwJ	rxJJI	IbWRL	...	nZcTi	pdvWY	LLuZj	qpzpO	kZRgh	VwGOP	DScEf	SKBnS	Enull	B
18973	zTghO	pYfmQ	lNhMv	34	iuxWN	-2	OLVWN	FDqwJ	ufugi	IbWRL	...	nZcTi	XrijK	LLuZj	qpzpO	kZRgh	VwGOP	JOdCB	SKBnS	Enull	B
20151	zTghO	pYfmQ	lNhMv	34	iuxWN	313	OMRWa	FDqwJ	rxJJI	IbWRL	...	nZcTi	FEjSW	lmjln	qpzpO	kZRgh	VwGOP	JOdCB	SKBnS	Enull	B
5730	zTghO	pYfmQ	lNhMv	58	iuxWN	138	jpSeC	FDqwJ	rxJJI	IbWRL	...	nZcTi	XrijK	lmjln	ZZzXr	kZRgh	VwGOP	ZwQQe	SKBnS	Enull	B
35033	zTghO	pYfmQ	lNhMv	122	iuxWN	68	OLVWN	FDqwJ	rxJJI	IbWRL	...	nZcTi	CRHYU	lmjln	qpzpO	kZRgh	VwGOP	WFgZH	SKBnS	Enull	B

5 rows × 442 columns

In [6]:

c_train.head()

Out[6]:

	GRGAYimk	DNnBfiSI	cNDTCUPU	GvTJUYOo	vmKoAlVH	LhUIIEHQ	DTNyjXJp	PNAiwXUz	ABnhybHK	yiuxBjHP	...	AJHrHUkH	PaEKIlvv	bFEsoTgJ	ihACfisf	obIQUcpS	lAvdypjD	ARWytYMz	eqJPmiPb	mmoCpqWS	country
id
57211	RslOh	SuNUt	gJLrc	EPKkJ	qKiiE	7	XuMYE	-5	QqETe	umyco	...	laFxs	kBQRJ	qcUVH	AmPtx	YXwVA	jSoky	NwjRA	wnPqZ	52	C
62519	jPUAt	boDkI	gJLrc	EPKkJ	YXkKd	7	XuMYE	331	sEJgr	yYwlq	...	laFxs	kBQRJ	eusFW	AmPtx	LSPRW	jSoky	NwjRA	wnPqZ	100	C
11614	OpTiw	boDkI	vURog	EPKkJ	qKiiE	9	XuMYE	-1	sEJgr	umyco	...	laFxs	oUXSJ	eusFW	AmPtx	YXwVA	jSoky	NwjRA	wnPqZ	70	C
6470	RslOh	VgxgY	gJLrc	EPKkJ	YXkKd	9	zfhKi	-5	sEJgr	umyco	...	laFxs	kBQRJ	jqrBN	AmPtx	YXwVA	jSoky	NwjRA	wnPqZ	10	C
33558	IXFlv	VgxgY	kPTaD	EPKkJ	YXkKd	9	XuMYE	23	sEJgr	umyco	...	laFxs	kBQRJ	eusFW	AmPtx	LSPRW	jSoky	herus	wnPqZ	-5	C

5 rows × 164 columns

The first thing to notice is that each country's surveys have wildly different numbers of columns, so we'll plan on training separate models for each country and combining our predictions for submission at the end.

Poverty Distributions¶

Let's take a look at the class distributions for each country. In classification tasks, it's crucial to know the balance of class labels!

In [7]:

a_train.poor.value_counts().plot.bar(title="Number of Poor for country A")

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x116015fd0>

In [8]:

b_train.poor.value_counts().plot.bar(title="Number of Poor for country B")

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x1175d5e48>

In [9]:

c_train.poor.value_counts().plot.bar(title="Number of Poor for country C")

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x118300128>

Country A is well-balanced, but countries B and C are quite unbalanced. This could definitely impact the confidence of our predictor. But solving that problem is up to you – it's outside the scope of this humble benchmark.

We expect most of the data types here to be the dreaded object type, but let's make sure.

In [10]:

a_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8203 entries, 46107 to 39832
Columns: 345 entries, wBXbHZmp to country
dtypes: bool(1), float64(2), int64(2), object(340)
memory usage: 21.6+ MB

In [11]:

b_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3255 entries, 57071 to 4923
Columns: 442 entries, RzaXNcgd to country
dtypes: bool(1), float64(9), int64(14), object(418)
memory usage: 11.0+ MB

In [12]:

c_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6469 entries, 57211 to 7646
Columns: 164 entries, GRGAYimk to country
dtypes: bool(1), float64(1), int64(29), object(133)
memory usage: 8.1+ MB

Sure enough, the bool types are our labels--the poor column--then there are a few numeric types with the rest being object. We'll need to convert the object columns to categorical variables before training anything.

Pre-process the Data¶

We're going to do some simple pre-processing here. Standardizing the data and converting the object types to categoricals should get us pretty far. Let's write a couple of simple functions to help this effort.

In [13]:

# Standardize features
def standardize(df, numeric_only=True):
    numeric = df.select_dtypes(include=["int64", "float64"])

    # subtracy mean and divide by std
    df[numeric.columns] = (numeric - numeric.mean()) / numeric.std()

    return df


def pre_process_data(df, enforce_cols=None):
    print("Input shape:\t{}".format(df.shape))

    df = standardize(df)
    print("After standardization {}".format(df.shape))

    # create dummy variables for categoricals
    df = pd.get_dummies(df)
    print("After converting categoricals:\t{}".format(df.shape))

    # match test set and training set columns
    if enforce_cols is not None:
        to_drop = np.setdiff1d(df.columns, enforce_cols)
        to_add = np.setdiff1d(enforce_cols, df.columns)

        df.drop(to_drop, axis=1, inplace=True)
        df = df.assign(**{c: 0 for c in to_add})

    df.fillna(0, inplace=True)

    return df

Time to convert these surveys!

In [14]:

print("Country A")
aX_train = pre_process_data(a_train.drop("poor", axis=1))
ay_train = np.ravel(a_train.poor)

print("\nCountry B")
bX_train = pre_process_data(b_train.drop("poor", axis=1))
by_train = np.ravel(b_train.poor)

print("\nCountry C")
cX_train = pre_process_data(c_train.drop("poor", axis=1))
cy_train = np.ravel(c_train.poor)

Country A
Input shape:	(8203, 344)
After standardization (8203, 344)
After converting categoricals:	(8203, 859)

Country B
Input shape:	(3255, 441)
After standardization (3255, 441)
After converting categoricals:	(3255, 1432)

Country C
Input shape:	(6469, 163)
After standardization (6469, 163)
After converting categoricals:	(6469, 795)

The data is probably looking pretty different now. Let's take a peek at country A.

In [15]:

aX_train.head()

Out[15]:

	nEsgxvAq	OMtioXZZ	YFMZwKrU	TiwRslOh	wBXbHZmp_DkQlr	wBXbHZmp_JhtDR	SlDKnCuu_GUusz	SlDKnCuu_alLXR	KAJOWiiw_BIZns	KAJOWiiw_TuovO	...	JCDeZBXq_UyAms	HGPWuGlV_WKNwg	HGPWuGlV_vkbkA	GDUPaBQs_qCEuA	GDUPaBQs_qQxrL	WuwrCsIY_AITFl	WuwrCsIY_GAZGl	AlDbXTlZ_aQeIm	AlDbXTlZ_cecIq	country_A
id
46107	-1.447160	0.325746	1.099716	-0.628045	0	1	1	0	0	1	...	0	0	1	0	1	1	0	1	0	1
82739	-0.414625	-0.503468	-0.016050	0.713467	0	1	1	0	0	1	...	1	0	1	0	1	1	0	0	1	1
9646	0.617910	-0.503468	-0.016050	-0.628045	0	1	1	0	1	0	...	1	0	1	0	1	1	0	0	1	1
10975	0.617910	-1.332682	-1.131816	0.713467	0	1	1	0	0	1	...	1	0	1	0	1	1	0	0	1	1
16463	0.617910	0.325746	-1.131816	-0.180874	0	1	0	1	0	1	...	1	0	1	0	1	0	1	1	0	1

5 rows × 859 columns

Oh yeah, now that looks like the kind of matrix scikit-learn wants to process!

The Error Metric - MeanLogLoss¶

The error metric for this competition is our old friend, log loss ... with a twist. Since we're predicting for three countries, our overall score is going to be the mean of the log losses for each country. However, the countries labels are conditionally independent, so in practice we should be able to train three independent models and combine their predictions for submission.

See the competition submission page for more info on the metric!

Build the Model¶

As mentioned above, we're keeping this benchmark short, sweet, and simple. So where do we turn when looking for a great out-of-the-box model? If you answered "Random Forests!" then we may just be two trees of the same ensemble. No? Then perhaps we're... splitting on the same node? At any rate, random forests are often a good model to try first, especially when we have numeric and categorical variables in our feature space.

Random Forest¶

In scikit-learn, it almost couldn't be easier to grow a random forest with a few lines of code.

In [16]:

from sklearn.ensemble import RandomForestClassifier


def train_model(features, labels, **kwargs):
    # instantiate model
    model = RandomForestClassifier(n_estimators=50, random_state=0)

    # train model
    model.fit(features, labels)

    # get a (not-very-useful) sense of performance
    accuracy = model.score(features, labels)
    print(f"In-sample accuracy: {accuracy:0.2%}")

    return model

Another classic from xkcd.

That's it as far model building is concerned. Let's grow some trees!

In [17]:

model_a = train_model(aX_train, ay_train)

In-sample accuracy: 100.00%

In [18]:

model_b = train_model(bX_train, by_train)

In-sample accuracy: 99.94%

In [19]:

model_c = train_model(cX_train, cy_train)

In-sample accuracy: 100.00%

Time to Predict and Submit¶

Remember, accuracy is not a very informative metric, especially when dealing with imbalanced classes. Furthermore, accuracy is not the metric for this competition!

The above scores suggest little more than an overfit training set. But it's confidence that counts – we'll need to use the .predict_proba() method to generate our submissions. Let's load up the test data, process it, and see what we get.

In [20]:

# load test data
a_test = pd.read_csv(data_paths["A"]["test"], index_col="id")
b_test = pd.read_csv(data_paths["B"]["test"], index_col="id")
c_test = pd.read_csv(data_paths["C"]["test"], index_col="id")

In [21]:

# process the test data
a_test = pre_process_data(a_test, enforce_cols=aX_train.columns)
b_test = pre_process_data(b_test, enforce_cols=bX_train.columns)
c_test = pre_process_data(c_test, enforce_cols=cX_train.columns)

Input shape:	(4041, 344)
After standardization (4041, 344)
After converting categoricals:	(4041, 851)
Input shape:	(1604, 441)
After standardization (1604, 441)
After converting categoricals:	(1604, 1419)
Input shape:	(3187, 163)
After standardization (3187, 163)
After converting categoricals:	(3187, 773)

Note that we're taking a very simple approach to filling missing values, as well as enforcing column consistency after converting to categoricals. (See the preprocessing function again to see what enforce_cols actually does.)

Make Predictions¶

To return the confidence probabilities that the submission format requires, we need to call the predict_proba() method on our models.

In [22]:

a_preds = model_a.predict_proba(a_test)
b_preds = model_b.predict_proba(b_test)
c_preds = model_c.predict_proba(c_test)

That was easy enough. Time to format the predictions and send them on their way.

Save Submission¶

We'll write a simple function that converts the predictions a DataFrame and adds a column for the correct country code.

In [23]:

def make_country_sub(preds, test_feat, country):
    # make sure we code the country correctly
    country_codes = ["A", "B", "C"]

    # get just the poor probabilities
    country_sub = pd.DataFrame(
        data=preds[:, 1],  # proba p=1
        columns=["poor"],
        index=test_feat.index,
    )

    # add the country code for joining later
    country_sub["country"] = country
    return country_sub[["country", "poor"]]

In [24]:

# convert preds to data frames
a_sub = make_country_sub(a_preds, a_test, "A")
b_sub = make_country_sub(b_preds, b_test, "B")
c_sub = make_country_sub(c_preds, c_test, "C")

Finally, it's time to combine our predictions and save for submission!

In [25]:

submission = pd.concat([a_sub, b_sub, c_sub])

How about one last look at the fruits of or hard work...

In [26]:

submission.head()

Out[26]:

	country	poor
id
418	A	0.32
41249	A	0.28
16205	A	0.26
97501	A	0.36
67756	A	0.26

In [27]:

submission.tail()

Out[27]:

	country	poor
id
6775	C	0.30
88300	C	0.20
35424	C	0.20
81668	C	0.28
98377	C	0.18

Looks good, let's save and send'er off!

In [28]:

submission.to_csv("submission.csv")

Submit to Leaderboard¶

Woohoo! It's a start! And that's exactly what we intend with these benchmarks. We're sure you'll be able to top this model in no time, and we can't wait to see what you come up with.

Visit The World Bank's site to learn more about how poverty is measured.

Benchmark for Pover-T Test - Predicting Poverty

Loading the Data¶

Poverty Distributions¶

Pre-process the Data¶

The Error Metric - MeanLogLoss¶

Build the Model¶

Random Forest¶

Time to Predict and Submit¶

Make Predictions¶

Save Submission¶

Submit to Leaderboard¶

Tags

Latest posts

Community Spotlight: Paola Ruiz, Néstor González, Daniel Crovo

Community Spotlight: Kirill Brodt

Jump-starting data infrastructure and in-house data expertise

A production application to support survivors of human trafficking

Life beyond the leaderboard

(Tech) Infrastructure Week for the Nonprofit Sector

Meet the winners of Phase 2 of the PREPARE Challenge

AI sauce on everything: Reflections on ASU+GSV 2025

Open-source packages for using speech data in ML

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Crowdsourcing solutions for AI-assisted early literacy screening

Where to find a data job for a good cause

Meet the Winners of the Youth Mental Health Narratives Challenge

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

10 takeaways from 10 years of data science for social good

Making higher education data more accessible

Goodnight Moon, Hello Early Literacy Screening Benchmark

Youth Mental Health: Automated Abstraction Benchmark

Meet the winners of Phase 1 of the PREPARE Challenge

Work with us to build a better world

Loading the Data¶

Poverty Distributions¶

Pre-process the Data¶

The Error Metric - MeanLogLoss¶

Build the Model¶

Random Forest¶

Time to Predict and Submit¶

Make Predictions¶

Save Submission¶

Submit to Leaderboard¶

Tags

Stay updated

Latest posts

Work with us to build a better world