by
Peter Bull
We're excited to launch a new comeptition. If you're not sure what we're talking about, head to: America's Next Top (Statistical) Model.
US presidential elections come but once every 4 years, and this one's a big one. The new president will help shape policies on education, healthcare, energy, the environment, international relations, aid, and more. There are lots of people trying to predict what will happen. Can you top them?
In this challenge, you'll predict the percent of each state that will vote for each candidate. You can use any data that's available to the public. Come election night, we'll see who's model had the best vision for the country!
%matplotlib inline
import seaborn as sns
# no warnings in our blog post, plz
import warnings
warnings.filterwarnings('ignore')
What do the pollsters say?¶
First things first, we need some data. The bread and butter of election forecasting is polling data, though some believe that is changing rapidly. The Huffington Post makes available an excellent API for getting data from election polls.
We've gone ahead and collected polling data by state for 2012, and we can read in that CSV
using pandas
.
import pandas as pd
polls2012 = pd.read_csv("all-polls-2012.csv", index_col=0)
polls2012.head()
We can see that the different polls have different dates, methodologies, number of observations, margins of error, and are run by different pollsters. We won't dig in to the method, margin of error, or the number of observations in our first-pass model, but you can see how these would be helpful by looking at other forecasts, for example the 538 model.
We'll also need a test set to create our model--in this case, the we're using the results of the most recent presidential election, 2012. You can imagine that including more recent elections (e.g., congressional or gubenatorial races) may help improve the forecast that we create for the 2016 presidential election.
results2012 = pd.read_csv("data/final/private/2012-actual-returns.csv", index_col=0)
results2012.head()
Turn polls into features¶
In order to predict the vote percentage for a candidate in each state, we need state-level features. For each state, we have a different number of polls, conducted using a different methodology and with varying recency (polls are less often conducted in states that aren't up for grabs). Our first decision is how many polls to use. Given that we expect voters to change their minds over time, we'll just work with up to 5 of the most recent polls. With that in mind, for each state, for each of the 5 most recent polls, we'll create the following features:
- Number of days to the election: This helps the model understand how recent the poll is (therefore, how useful its measures will be).
- Margin of error: If the poll has a margin of error, we'll want to incorporate that as a feature in the model.
- Democrat percentage: The percentage of responds that say they will vote for the democratic candidate (For 2012, this is Obama; for 2016, this is Clinton).
- Republican percentage: The percentage of responds that say they will vote for the republican candidate (For 2012, this is Romney; for 2016, this is Trump).
- Stein percentage: The percentage of responds that say they will vote for Jill Stein, the Green Party candidate in 2012 and 2016. Not all polls inlcude 3rd party candidates.
- Johnson percentage: The percentage of responds that say they will vote for Gary Johnson, the Libertarian Party candidate in 2012 and 2016. Not all polls inlcude 3rd party candidates.
from datetime import datetime
def build_features_from_polls(states, all_polls, is2012=True, n_polls=5):
""" Builds a dataframe where each row is a state, and each column is a
property of one of the last 5 (n_polls) polls in that state.
"""
all_states_rows = []
for st in states:
st_polls = all_polls[all_polls.state == st]
st_polls.sort_values('date', ascending=False, inplace=True)
row = {}
limit = min(st_polls.shape[0], n_polls)
for i in range(limit):
this_poll = st_polls.iloc[i]
# calculate the number of days until the election
election_day = datetime(2012, 11, 6) if is2012 else datetime(2016, 11, 8)
days_to_election = (election_day - pd.to_datetime(this_poll.date)).days
# get the dem and rep candidates:
dem_pct = this_poll.obama if is2012 else this_poll.clinton
rep_pct = this_poll.romney if is2012 else this_poll.trump
poll_data = {'poll_{}_days_to_election'.format(i): days_to_election,
'poll_{}_moe'.format(i): this_poll.moe,
'poll_{}_democrat'.format(i): dem_pct,
'poll_{}_republican'.format(i): rep_pct,
'poll_{}_johnson'.format(i): this_poll.johnson,
'poll_{}_stein'.format(i): this_poll.stein}
row.update(poll_data)
all_states_rows.append(row)
features = pd.DataFrame(all_states_rows, index=states)
# for unavailable data, generally fill in the mean for that column
features.fillna(features.mean(axis=0), inplace=True)
# if there is no data in a column (all nans), fill in 0
features.fillna(0, inplace=True)
return features
features2012 = build_features_from_polls(results2012.index, polls2012)
features2012.head()
Time to do some forecasting!¶
Now it's time to make some predictions. We'll start off with a straightforward model: ordinary least squares regression. OLS can be written as:
$$ y_{i,c} = x_i ^T \beta_c + \varepsilon_i $$In our model, we can think of $i$ as a state. $y_{i,c}$ is the percent of the vote that candidate $c$ received in state $i$ in the election results. $x_i$ is the value from the polls for a given state.
$\beta_c$ is the vector of coefficients that correspond to each property of the polls for candidate $c$. For example, we would expect that $\beta_\textrm{stein}$, will be large for poll_stein
variables, but otherwise isn't very important for predicting the success of other candidates.
We'll also fit this model using the GridSearchCV
from sklearn
, which will compare the cross-validated scores for different hyperparameter values on the model and choose the best one. For the LinearRegresssion
model that we're starting with, we only look at one possible hyperparameter: wether or not we include a $\varepsilon_i$ parameter in our model.
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
ss = MinMaxScaler()
gscv = GridSearchCV(LinearRegression(),
dict(fit_intercept=[True, False], ),
scoring='neg_mean_squared_error')
clf = make_pipeline(ss, gscv)
clf.fit(features2012, results2012)
We've trained a linear regression model that scales the features we use to be between 0 and 1. Since all of the variables are of the same scale, we can plot the coefficients to get a sense for their relative effect on the prediction. Unsuprisingly, the results from the most recent poll are the most important feature!
# get the model coefficients
coeffs = clf.steps[1][1].best_estimator_.coef_
# plot the coefficients
(pd.DataFrame(coeffs, index=results2012.columns, columns=features2012.columns)
.T
.plot
.barh(figsize=(5, 15),
linewidth=0,
width=1.0))
Listen to the voice of the people¶
Now that we have the model, we can predict data for 2012 and evaluate how good the fit of that model is. To do that, we'll predict the percentages that each candidate will receive in each state.
# predict vote percentages with our trained model
preds = clf.predict(features2012)
# get submission format
submission2012 = pd.read_csv("2012-submission-format.csv", index_col=0)
preds2012 = submission2012.copy()
# fill in our predicted values and write to csv
preds2012.iloc[:, :] = preds
preds2012.to_csv("linear-model.csv")
preds2012.head()
If we submit on DrivenData, we can see our score against the 2012 election:
Feel the will of the people¶
Finally, it's time to see what we think about Trump v. Clinton. We've got a model now to predict presidential election outcomes based poll results. We can use this model to predict the current presidential election.
# raw poll data for 2016
polls2016 = pd.read_csv("all-polls-2016.csv", index_col=0)
# submission format for 2016 election
submission2016 = pd.read_csv("data/final/public/2016-submission-format.csv", index_col=0)
# create our features
features2016 = build_features_from_polls(submission2016.index,
polls2016,
is2012=False)
# make predictions (ensuring that the order of the 2016 columns matches the 2012 columns)
preds = clf.predict(features2016[features2012.columns])
# fill the predictions into the submission format
preds2016 = pd.DataFrame(preds,
index=submission2016.index,
columns=submission2016.columns)
preds2016.head()
We can see we now have predictions for every major candidate, for every state. We can submit this to DrivenData and mark it as our "Evaluation Submission" to indicate that these are our predictions for 2016:
But, one important question remains: who will win the election? That, of course, is subject to the rules of the Electoral College. The winner needs at least 270 electoral votes to become president. We can calculate that using data about how many electoral votes each state gets:
import numpy as np
electoral_data = pd.read_csv("2012-electoral-college.csv")
electoral_data.sort_values(by='State', inplace=True)
preds2016.sort_index(inplace=True)
preds2016['Dems'] = np.where(preds2016.Clinton > preds2016.Trump,
electoral_data.Electors,
0)
preds2016['Reps'] = np.where(preds2016.Clinton < preds2016.Trump,
electoral_data.Electors,
0)
print("===== PREDICTED ELECTORAL VOTES FOR EACH PARTY =======")
print(preds2016[['Dems', 'Reps']].sum())
That's not too bad!¶
We can see that our model predicts a Democratic victory given that 270 electoral votes are needed to win the election.
If we look at other election forecasts like 538, the NY Times, and Fox News we can see our model is not wildly out of line with current forecasts.
However, we can almost certainly do better! It's now up to you to MAKE THIS MODEL GREAT AGAIN...
For ideas on how to make the model even better, check out our election resources page on the competition website.
Prediction Map¶
For fun, here's the familiar electoral map to see at what our predictions look like. Looking at other prediction maps, we see that it might be worth it to gather more data and dig in to modeling decisions for Iowa, Virginia, DC (which we have going Republican, unlike most other models).
import json
import folium
# make percentages 0 - 100
mapdata = (preds2016 * 100).astype(float)
# state json has full names, so we need those in our data
mapdata.rename(index={'DC': 'District of Columbia'}, inplace=True)
mapdata.rename(index=electoral_data.set_index('State').Name.to_dict(), inplace=True)
# missing regions
mapdata.loc['Puerto Rico'] = [0., 0., 0., 0., 0., 0.]
# clip to a minimum and maximum that make sense for %
mapdata.loc[:,:] = np.clip(mapdata.values, 0.0, 100.0)
# add a pctg for the winner
mapdata['winner'] = np.where(mapdata.Clinton > mapdata.Trump,
mapdata.Clinton,
mapdata.Trump * -1)
# create map centered on US
election_map = folium.Map(location=[ 39.833, -98.583],
tiles="Mapbox Bright",
zoom_start=4)
# fill in the colors for who wins the state
election_map.choropleth(geo_path="states.json",
fill_opacity=0.8,
data=mapdata.reset_index(),
columns=['STATE ABBREVIATION', 'winner'],
fill_color='RdBu',
key_on='feature.properties.NAME')
election_map