blog

How to Predict Disturbances in the Geomagentic Field with LSTMs - Benchmark


by Christine Chung

MagNet: Model the Geomagnetic Field - Benchmark

Geomagnetic storms are caused by the interaction of solar wind with Earth's magnetic field. The resulting disturbances in the geomagnetic field can wreak havoc on GPS systems, satellite communication, electric power transmission, and more. These disturbances are measured by the Disturbance Storm-Time Index, or Dst.

In the MagNet: Model the Geomagnetic Field competition, your task is to forecast Dst in real-time to help satellite operators, power grid operators, and users of magnetic navigation systems prepare for magnetic disturbances.

The primary input data for the challenge is provided by sensor data from two satellites, NASA's ACE and NOAA's DSCOVR. This space weather data includes sensor readings related to both the interplanetary magnetic field and plasma from solar wind.

Solar flares interacting with Earth's magnetic field.
A cartoon of the Earth's magnetosphere. The Dst or disturbance-storm-time index is a measure of the “ring current” (blue) around the Earth. The ring current is an electric current carried by charged particles trapped in the magnetosphere.

Long Short Term Memory networks or LSTMs are a special kind of recurrent neural network especially suited to time series data. In this benchmark solution tutorial, we will show you how to implement a first-pass LSTM model for predicting Dst.

Specifically, we'll cover how to:

Let's get started!

Getting Set Up

First, we'll import the basic tools we'll need to load our data and do some exploration.

In [1]:
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%load_ext nb_black
%matplotlib inline

Now we can read in our data. You can find everything you need on the data download page.

We're provided three datasets that could potentially be used as inputs:

  • solar_wind.csv
  • satellite_position.csv
  • sunspots.csv

For now, we're going to ignore satellite_position.csv. This data could be useful because it describes the position of the satellite in relation to the Earth when the solar wind data was recorded. Feel free to experiment with using satellite positions as features in your model!

For the purposes of this tutorial, we're going to focus on the solar_wind and sunspot data. We'll also load up our labels, dst_labels.csv.

After reading in each csv, we're going to convert the timedelta columns to an actual timedelta object using pd.to_timdelta. That way we can take advantage of its convenient API.

Finally, we're going to set the indices of each dataframe to a multi-index composed of period and timedelta so that it's easier to join our dataframes.

In [2]:
DATA_PATH = Path("../data/final/public/")

dst = pd.read_csv(DATA_PATH / "dst_labels.csv")
dst.timedelta = pd.to_timedelta(dst.timedelta)
dst.set_index(["period", "timedelta"], inplace=True)

sunspots = pd.read_csv(DATA_PATH / "sunspots.csv")
sunspots.timedelta = pd.to_timedelta(sunspots.timedelta)
sunspots.set_index(["period", "timedelta"], inplace=True)

solar_wind = pd.read_csv(DATA_PATH / "solar_wind.csv")
solar_wind.timedelta = pd.to_timedelta(solar_wind.timedelta)
solar_wind.set_index(["period", "timedelta"], inplace=True)

Data Exploration

Let's start by looking at our label, dst. We have hourly values split into three periods - let's inspect the distribution of dst values across those periods.

In [3]:
print("Dst shape: ", dst.shape)
dst.head()
Dst shape:  (139872, 1)
Out[3]:
dst
period timedelta
train_a 0 days 00:00:00 -7
0 days 01:00:00 -10
0 days 02:00:00 -10
0 days 03:00:00 -6
0 days 04:00:00 -2
In [4]:
dst.groupby("period").describe()
Out[4]:
dst
count mean std min 25% 50% 75% max
period
train_a 28824.0 -16.576707 26.083191 -387.0 -26.0 -12.0 -1.0 65.0
train_b 52584.0 -9.695154 16.443049 -223.0 -17.0 -7.0 1.0 59.0
train_c 58464.0 -9.556325 16.506404 -374.0 -16.0 -7.0 0.0 67.0

We have nearly 140,000 observations of hourly dst data, representing over 15 years. There are almost twice as many observations in either the train_b or train_c periods than there are train_a. It also seems train_a represents a more intense period, given that it has a lower mean and higher standard deviation. Also note that most of the values are negative.

A very strong magnetic field disturbance has a large Dst value, measured in nano-Teslas (nT). Because these disturbances are usually flowing towards the Earth, the values are negative. Sometimes Dst can be highly positive. During calm conditions, Dst values are situated at or just below 0.

We can visualize the difference between the periods by plotting the distributions.

In [5]:
fig, ax = plt.subplots()
colors = ["r", "b", "y"]
for i, period in enumerate(dst.groupby("period")):
    period_name, df = period
    ax.hist(df, alpha=0.5, color=colors[i], bins=100, label=period_name)
plt.legend()
plt.title("Distribution of Dst across Periods")
plt.show()

Indeed, we can see that train_c is the most intense of the periods, followed by train_b. Let's keep this difference in distribution in mind when it comes to splitting the data.

For now, we can move on to exploring our features. The primary set of input data comes from the solar wind sensor readings gathered from the ACE and DSCOVR satellites. These measure characteristics of the interplanetary magnetic field (IMF) and plasma from solar wind.

For a complete description of the variables and what they represent, check out the competition Problem Description page.

Let's take a look:

In [6]:
print("Solar wind shape: ", solar_wind.shape)
solar_wind.head()
Solar wind shape:  (8392320, 15)
Out[6]:
bx_gse by_gse bz_gse theta_gse phi_gse bx_gsm by_gsm bz_gsm theta_gsm phi_gsm bt density speed temperature source
period timedelta
train_a 0 days 00:00:00 -5.55 3.00 1.25 11.09 153.37 -5.55 3.00 1.25 11.09 153.37 6.80 1.53 383.92 110237.0 ac
0 days 00:01:00 -5.58 3.16 1.17 10.10 151.91 -5.58 3.16 1.17 10.10 151.91 6.83 1.69 381.79 123825.0 ac
0 days 00:02:00 -5.15 3.66 0.85 7.87 146.04 -5.15 3.66 0.85 7.87 146.04 6.77 1.97 389.11 82548.0 ac
0 days 00:03:00 -5.20 3.68 0.68 6.17 146.17 -5.20 3.68 0.68 6.17 146.17 6.74 1.97 389.11 82548.0 ac
0 days 00:04:00 -5.12 3.68 0.49 4.62 145.72 -5.12 3.68 0.49 4.62 145.72 6.65 1.77 384.26 94269.0 ac
In [7]:
print("Sunspot shape: ", sunspots.shape)
sunspots.head()
Sunspot shape:  (192, 1)
Out[7]:
smoothed_ssn
period timedelta
train_a 0 days 65.4
13 days 72.0
44 days 76.9
74 days 80.8
105 days 85.4

We can see that we're mostly working with floats, except for the solar wind source column which tells us which of the two satellites recorded the data. We also see that the size of the solar_wind data is fairly large, close to 8.4 million rows. That makes sense, given we're working with minutely values.

On the other hand, we only have 192 observations of monthly sunspot data. When it comes to feature generation, we should think about the best ways to combine these features given their different frequencies.

Now let's look at the distribution of values across periods. We know that the distribution of dst is quite different, and we expect a similar pattern here.

In [8]:
solar_wind.groupby("period").describe().T
Out[8]:
period train_a train_b train_c
bx_gse count 1.575012e+06 3.084130e+06 3.407290e+06
mean -1.781301e+00 -3.088789e-01 -4.619076e-01
std 4.339212e+00 3.627830e+00 3.245485e+00
min -5.463000e+01 -2.937000e+01 -4.546000e+01
25% -4.960000e+00 -3.070000e+00 -2.800000e+00
... ... ... ... ...
temperature min 1.000000e+04 1.496000e+03 0.000000e+00
25% 4.364900e+04 3.741400e+04 4.007400e+04
50% 7.923800e+04 8.552400e+04 7.152100e+04
75% 1.325500e+05 1.873250e+05 1.310880e+05
max 6.223700e+06 4.206672e+06 5.751308e+06

112 rows × 3 columns

In [9]:
sunspots.groupby("period").describe().T
Out[9]:
period train_a train_b train_c
smoothed_ssn count 40.000000 72.000000 80.000000
mean 136.902500 51.850000 24.313750
std 34.563168 39.200266 19.020414
min 65.400000 3.900000 2.200000
25% 108.375000 15.325000 7.775000
50% 151.500000 43.150000 20.500000
75% 164.400000 91.225000 38.525000
max 175.200000 116.400000 69.500000

No surprise here. The mean and standard deviation of values in train_a is generally more intense than in the other two periods.

We also notice that our values exist across very different scales. For instance, temperature values are quite high - reaching into the hundreds of thousands Kelvin. Meanwhile, IMF readings are fairly small values, and usually negative. One of the best practices when training deep learning models is scaling your features; we'll definitely want to think about that later on.

Let's do a few visualizations before we move onto feature generation. First, we'll just plot the first 1,000 rows for some of our time series data to get a sense of its shape. Instead of plotting all of our features, let's just choose a few IMF features and a few plasma features.

In [10]:
def show_raw_visualization(data):
    fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 15), dpi=80)
    for i, key in enumerate(data.columns):
        t_data = data[key]
        ax = t_data.plot(
            ax=axes[i // 2, i % 2],
            title=f"{key.capitalize()}",
            rot=25,
        )

    fig.subplots_adjust(hspace=0.8)
    plt.tight_layout()


cols_to_plot = ["bx_gse", "bx_gsm", "bt", "density", "speed", "temperature"]
show_raw_visualization(solar_wind[cols_to_plot].iloc[:1000])

The first thing to notice is there are data gaps that we'll have to contend with. Let's count those:

In [11]:
solar_wind.isna().sum()
Out[11]:
bx_gse         325888
by_gse         325888
bz_gse         325888
theta_gse      325888
phi_gse        326388
bx_gsm         325888
by_gsm         325888
bz_gsm         325888
theta_gsm      325888
phi_gsm        326388
bt             325888
density        684890
speed          689555
temperature    811768
source         316816
dtype: int64

The proportion varies by feature, but every one will require some kind of imputation. Turns out sensor readings from space aren't always reliable - instruments are subject to all kinds of nasty space weather. It's up to us to figure out a sensible way of dealing with these missing values.

Another observation is that the IMF features bx_gsm and bx_gse are closely related. This could present multicollinearity issues if both are present in our model. As our last exploratory step, let's plot a correlation matrix to find other relationships between our features.

In [12]:
corr = solar_wind.join(sunspots).join(dst).fillna(method="ffill").corr()

plt.figure(figsize=(10, 5))
plt.matshow(corr, fignum=1)
plt.xticks(range(corr.shape[1]), corr.columns, fontsize=14, rotation=90)
plt.gca().xaxis.tick_bottom()
plt.yticks(range(corr.shape[1]), corr.columns, fontsize=14)


cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title("Feature Correlation Heatmap", fontsize=16)
plt.show()

Interesting! The plasma related features like speed and temperature look to be strongly anti-correlated with Dst. The IMF feature bt also exhibits strong anti-correlation.

As we suspected, the gsm and gse variables are strongly correlated. We probably want to leave some of those out.

With this information in hand, let's move onto feature generation.

Feature Generation

Before we get too far, let's first set seeds to make sure that all of our work is reproducible. We'll be using Keras, which makes use of both TensorFlow and numpy on the backend, so we'll be setting a seed for both.

In [13]:
from numpy.random import seed
from tensorflow.random import set_seed

seed(2020)
set_seed(2021)

Great! Now let's think about our feature processing pipeline. Let's recap what we've learned from data exploration:

  1. features exist across different scales
  2. features are provided at different frequencies
  3. certain IMF features are highly correlated
  4. there are many missing values

For the purposes of this tutorial, we'll address each of these issues using basic solutions. For your own implementations, you should get creative!

1. Features exist across different scales

We'll fix this using the StandardScaler from scikit-learn. It's a pretty standard affair - it'll help us subtract the mean and divide by the standard deviation for each feature. What's nice about it is that it'll save the parameters used for scaling so that we can re-use it later during prediction. You could also experiment with using the MinMaxScaler or other scaling methods instead.

2. Features are provided at different frequencies

One easy way to fix this is to aggregate values to the same frequency. Since our dst values are provided hourly, we'll aggregate solar_wind data to the hour. The timedelta API will make that really easy for us to group by the hour. We'll take both the mean and std of each value for each hour. This would be a great place to experiment with different frequencies and aggregations to try and get the best performance out of your model.

3. Certain IMF features are highly correlated

We'll solve this by taking a subset of the data. Let's use the plasma variables temperature, speed, and density, as well as the IMF feature bt. Since we saw that gse and gsm variables were highly correlated, let's only take the gse variables. This is a first pass at subsetting the variables - you may want to take a more principled approach in your implementation.

4. There are many missing values

We've already decided to aggregate our minutely solar wind data to an hourly frequency. That'll help with some of the missing values. For the remaining, there are a few different methods we could try. We want something that will be effective that will also work in a real-time environment. For simplicity, we're going to interpolate between missing values using df.interpolate(). Again, this is another great place for experimentation! You could impute using the mean or median, or you might even think about developing another model to estimate the missing values.

Finally, we have our monthly sunspot numbers. These aren't really "missing" - they're just provided at a coarser frequency. We'll fix that by using the forward fill or ffill method of imputation so that the correct monthly number is assigned to each row.

In [14]:
from sklearn.preprocessing import StandardScaler

# subset of solar wind features to use for modeling
SOLAR_WIND_FEATURES = [
    "bt",
    "temperature",
    "bx_gse",
    "by_gse",
    "bz_gse",
    "speed",
    "density",
]

# all of the features we'll use, including sunspot numbers
XCOLS = (
    [col + "_mean" for col in SOLAR_WIND_FEATURES]
    + [col + "_std" for col in SOLAR_WIND_FEATURES]
    + ["smoothed_ssn"]
)


def impute_features(feature_df):
    """Imputes data using the following methods:
    - `smoothed_ssn`: forward fill
    - `solar_wind`: interpolation
    """
    # forward fill sunspot data for the rest of the month
    feature_df.smoothed_ssn = feature_df.smoothed_ssn.fillna(method="ffill")
    # interpolate between missing solar wind values
    feature_df = feature_df.interpolate()
    return feature_df


def aggregate_hourly(feature_df, aggs=["mean", "std"]):
    """Aggregates features to the floor of each hour using mean and standard deviation.
    e.g. All values from "11:00:00" to "11:59:00" will be aggregated to "11:00:00".
    """
    # group by the floor of each hour use timedelta index
    agged = feature_df.groupby(
        ["period", feature_df.index.get_level_values(1).floor("H")]
    ).agg(aggs)
    # flatten hierachical column index
    agged.columns = ["_".join(x) for x in agged.columns]
    return agged


def preprocess_features(solar_wind, sunspots, scaler=None, subset=None):
    """
    Preprocessing steps:
        - Subset the data
        - Aggregate hourly
        - Join solar wind and sunspot data
        - Scale using standard scaler
        - Impute missing values
    """
    # select features we want to use
    if subset:
        solar_wind = solar_wind[subset]

    # aggregate solar wind data and join with sunspots
    hourly_features = aggregate_hourly(solar_wind).join(sunspots)

    # subtract mean and divide by standard deviation
    if scaler is None:
        scaler = StandardScaler()
        scaler.fit(hourly_features)

    normalized = pd.DataFrame(
        scaler.transform(hourly_features),
        index=hourly_features.index,
        columns=hourly_features.columns,
    )

    # impute missing values
    imputed = impute_features(normalized)

    # we want to return the scaler object as well to use later during prediction
    return imputed, scaler

Now let's apply our pipeline and see what your new data look like.

In [15]:
features, scaler = preprocess_features(solar_wind, sunspots, subset=SOLAR_WIND_FEATURES)
print(features.shape)
features.head()
(139872, 15)
Out[15]:
bt_mean bt_std temperature_mean temperature_std bx_gse_mean bx_gse_std by_gse_mean by_gse_std bz_gse_mean bz_gse_std speed_mean speed_std density_mean density_std smoothed_ssn
period timedelta
train_a 0 days 00:00:00 0.499705 2.443614 -0.375267 0.383941 -1.599207 -0.381502 0.419516 0.031658 0.300358 -0.651645 -0.738546 0.862524 -0.775827 -0.205724 0.139444
0 days 01:00:00 0.547177 -0.224580 -0.479430 0.953178 -1.757995 -0.867747 0.179257 -0.272971 0.446103 -0.517913 -0.986904 0.995063 -0.861692 -0.058215 0.139444
0 days 02:00:00 0.739905 -0.770240 -0.574831 -0.192518 -1.912116 -1.114317 0.183266 -0.822786 0.770174 -0.876490 -1.013548 0.554085 -0.846222 -0.220012 0.139444
0 days 03:00:00 0.699098 -0.278783 -0.324709 0.325491 -1.809045 -0.783042 -0.378111 0.341156 0.621194 -0.290211 -0.826469 -0.211185 -0.404306 0.218373 0.139444
0 days 04:00:00 0.223933 -0.225168 -0.313432 0.201600 -1.338802 -0.484910 0.072745 1.023019 0.467629 -0.478080 -0.601238 1.016033 -0.371487 0.097253 0.139444
In [16]:
# check to make sure missing values are filled
assert (features.isna().sum() == 0).all()

Our final feature set is composed of 15 features - the mean and standard deviation of seven solar_wind features, along with smoothed_ssn. We've also saved our scaler object. We'll serialize this later along with our model so that it can be used to preprocess features during prediction.

Before we start modeling, we also have to reshape our label, dst. Remember that we have to predict both t0, the current timestep, and t+1, an hour ahead. We'll train our LSTM to do multi-step prediction by providing it both steps. To do that, we just add another column called t1 that is dst shifted by 1. We'll also renamed our dst column to t0 for consistency.

In [17]:
YCOLS = ["t0", "t1"]


def process_labels(dst):
    y = dst.copy()
    y["t0"] = y.groupby("period").dst.shift(-1)
    y["t1"] = y.groupby("period").dst.shift(-2)
    return y[YCOLS]


labels = process_labels(dst)
labels.head()
Out[17]:
t0 t1
period timedelta
train_a 0 days 00:00:00 -7 -10.0
0 days 01:00:00 -10 -10.0
0 days 02:00:00 -10 -6.0
0 days 03:00:00 -6 -2.0
0 days 04:00:00 -2 3.0

Et voilà! Now we have our features and our labels. Let's join them together into one data df so that it's easier to keep the appropriate rows together when splitting into our train, test, and validation sets.

In [18]:
data = labels.join(features)
data.head()
Out[18]:
t0 t1 bt_mean bt_std temperature_mean temperature_std bx_gse_mean bx_gse_std by_gse_mean by_gse_std bz_gse_mean bz_gse_std speed_mean speed_std density_mean density_std smoothed_ssn
period timedelta
train_a 0 days 00:00:00 -7 -10.0 0.499705 2.443614 -0.375267 0.383941 -1.599207 -0.381502 0.419516 0.031658 0.300358 -0.651645 -0.738546 0.862524 -0.775827 -0.205724 0.139444
0 days 01:00:00 -10 -10.0 0.547177 -0.224580 -0.479430 0.953178 -1.757995 -0.867747 0.179257 -0.272971 0.446103 -0.517913 -0.986904 0.995063 -0.861692 -0.058215 0.139444
0 days 02:00:00 -10 -6.0 0.739905 -0.770240 -0.574831 -0.192518 -1.912116 -1.114317 0.183266 -0.822786 0.770174 -0.876490 -1.013548 0.554085 -0.846222 -0.220012 0.139444
0 days 03:00:00 -6 -2.0 0.699098 -0.278783 -0.324709 0.325491 -1.809045 -0.783042 -0.378111 0.341156 0.621194 -0.290211 -0.826469 -0.211185 -0.404306 0.218373 0.139444
0 days 04:00:00 -2 3.0 0.223933 -0.225168 -0.313432 0.201600 -1.338802 -0.484910 0.072745 1.023019 0.467629 -0.478080 -0.601238 1.016033 -0.371487 0.097253 0.139444

Splitting the Data

We want to split our data into three datasets. As you might have guessed, the train set will be used for training. We'll also pass a validation set to keras as it's training to monitor the modeling fitting over epochs. Finally, we'll use a test set to evaluate our model for over or under fitting before we submit to the competition.

We have two atypical concerns to consider when splitting this dataset:

First, we're dealing with timeseries data. Observations in a time series are not indepedent, so we cannot randomly assign observations across our datasets. We also don't want to "cheat" by leaking future information into our training data. In the real-world, we will never be able to train on data from the future, so we should emulate those same contraints here.

Secondly, we have three non-contiguous periods, meaning we have gaps in our data. We don't know how long each gap is or in what order each period occured. We also know that the three periods are differently distributed. That suggests that observations from each period should be included in our train set, instead of reserving one wholesale as our test or validation set.

To solve these problems, we'll hold out the last 6,000 rows from each period for our test set, and reserve last 3,000 before that for our validation set. The remaining rows will be used in the training set.

In [19]:
def get_train_test_val(data, test_per_period, val_per_period):
    """Splits data across periods into train, test, and validation"""
    # assign the last `test_per_period` rows from each period to test
    test = data.groupby("period").tail(test_per_period)
    interim = data[~data.index.isin(test.index)]
    # assign the last `val_per_period` from the remaining rows to validation
    val = interim.groupby("period").tail(val_per_period)
    # the remaining rows are assigned to train
    train = interim[~interim.index.isin(val.index)]
    return train, test, val


train, test, val = get_train_test_val(data, test_per_period=6_000, val_per_period=3_000)

We can visualize our splits by plotting a horizontal barchart depicting the counts across periods.

In [20]:
ind = [0, 1, 2]
names = ["train_a", "train_b", "train_c"]
width = 0.75
train_cnts = [len(df) for _, df in train.groupby("period")]
val_cnts = [len(df) for _, df in val.groupby("period")]
test_cnts = [len(df) for _, df in test.groupby("period")]

p1 = plt.barh(ind, train_cnts, width)
p2 = plt.barh(ind, val_cnts, width, left=train_cnts)
p3 = plt.barh(ind, test_cnts, width, left=np.add(val_cnts, train_cnts).tolist())

plt.yticks(ind, names)
plt.ylabel("Period")
plt.xlabel("Hourly Timesteps")
plt.title("Train/Validation/Test Splits over Periods", fontsize=16)
plt.legend(["Train", "Validation", "Test"])
Out[20]:
<matplotlib.legend.Legend at 0x7f073dc5f940>

Let's do a final visual inspection of each of our datasets before we begin building our model.

In [21]:
print(train.shape)
train.head()
(121872, 17)
Out[21]:
t0 t1 bt_mean bt_std temperature_mean temperature_std bx_gse_mean bx_gse_std by_gse_mean by_gse_std bz_gse_mean bz_gse_std speed_mean speed_std density_mean density_std smoothed_ssn
period timedelta
train_a 0 days 00:00:00 -7 -10.0 0.499705 2.443614 -0.375267 0.383941 -1.599207 -0.381502 0.419516 0.031658 0.300358 -0.651645 -0.738546 0.862524 -0.775827 -0.205724 0.139444
0 days 01:00:00 -10 -10.0 0.547177 -0.224580 -0.479430 0.953178 -1.757995 -0.867747 0.179257 -0.272971 0.446103 -0.517913 -0.986904 0.995063 -0.861692 -0.058215 0.139444
0 days 02:00:00 -10 -6.0 0.739905 -0.770240 -0.574831 -0.192518 -1.912116 -1.114317 0.183266 -0.822786 0.770174 -0.876490 -1.013548 0.554085 -0.846222 -0.220012 0.139444
0 days 03:00:00 -6 -2.0 0.699098 -0.278783 -0.324709 0.325491 -1.809045 -0.783042 -0.378111 0.341156 0.621194 -0.290211 -0.826469 -0.211185 -0.404306 0.218373 0.139444
0 days 04:00:00 -2 3.0 0.223933 -0.225168 -0.313432 0.201600 -1.338802 -0.484910 0.072745 1.023019 0.467629 -0.478080 -0.601238 1.016033 -0.371487 0.097253 0.139444
In [22]:
print(test.shape)
test.head()
(18000, 17)
Out[22]:
t0 t1 bt_mean bt_std temperature_mean temperature_std bx_gse_mean bx_gse_std by_gse_mean by_gse_std bz_gse_mean bz_gse_std speed_mean speed_std density_mean density_std smoothed_ssn
period timedelta
train_a 951 days 00:00:00 -9 -9.0 -0.750759 0.234958 -0.321374 -0.263304 0.437479 -0.443221 -0.202748 0.318096 0.849393 -0.020478 -0.145924 -0.503193 -0.407614 -0.315154 2.113449
951 days 01:00:00 -9 -9.0 -0.791318 0.284752 -0.319784 -0.301259 0.321370 0.422124 -0.240443 0.166232 0.776471 -0.220270 -0.127991 -0.084971 -0.427620 -0.198320 2.113449
951 days 02:00:00 -9 -6.0 -0.980331 0.573975 -0.257948 -0.344924 0.530721 0.818555 -0.137579 0.157993 0.147218 -0.276363 -0.120626 -0.310757 -0.435274 -0.374498 2.113449
951 days 03:00:00 -6 -7.0 -1.043536 -0.412378 -0.209809 -0.397676 0.753977 -0.432534 -0.157578 -0.430909 0.121454 -0.282263 -0.180313 -0.531222 -0.414209 -0.365694 2.113449
951 days 04:00:00 -7 -7.0 -0.993770 -0.603035 -0.189010 -0.402766 0.789420 -0.391195 -0.252788 -0.372368 -0.149991 -0.510691 -0.219717 -0.649039 -0.437909 -0.267683 2.113449
In [23]:
print(val.shape)
val.head()
(9000, 17)
Out[23]:
t0 t1 bt_mean bt_std temperature_mean temperature_std bx_gse_mean bx_gse_std by_gse_mean by_gse_std bz_gse_mean bz_gse_std speed_mean speed_std density_mean density_std smoothed_ssn
period timedelta
train_a 1076 days 00:00:00 -10 -14.0 -0.500369 -0.556432 -0.565764 -0.566800 0.595906 -1.050481 -0.790488 -1.016119 -0.663002 -0.983979 -0.993809 -0.803758 0.255588 -0.321380 1.912994
1076 days 01:00:00 -14 -17.0 -0.491465 -0.540997 -0.414549 -0.492732 0.975872 -1.190283 -0.498741 -1.035309 -0.900495 -1.013405 -1.025679 -0.817710 -0.280802 -0.439676 1.912994
1076 days 02:00:00 -17 -17.0 -0.401423 -0.438652 -0.401142 -0.392821 0.787625 -0.642754 -0.876114 -0.601185 -0.756347 -0.762486 -1.056768 -0.634338 -0.388496 -0.323616 1.912994
1076 days 03:00:00 -17 -19.0 -0.326379 -0.231738 -0.468758 -0.049836 0.608997 -0.259921 -1.158882 -0.548127 -0.322424 -0.579160 -1.100709 -0.427564 -0.529769 -0.297505 1.912994
1076 days 04:00:00 -19 -19.0 -0.137671 -0.510389 -0.687011 -0.550497 0.073255 -0.722199 -1.426605 -0.961700 0.014622 -0.504161 -1.089066 -0.359673 -0.661522 -0.375966 1.912994

Building our Model

Now we can get started with model building!

The first thing we have to do is separate our data into sequences and batches for modeling. We have to decide on:

  • timesteps: this determines the sequence length, ie. how many timesteps in the past to use to predict each step at t0 and t1. Our data is aggregated hourly, so timesteps is equal to the number of hours we want to use for each prediction.
  • batch_size: this determines the number of samples to work through before a model's parameters are updated.

For this tutorial, we'll choose fairly standard numbers of 32 timesteps per sequence and 32 sequences per batch. These numbers will likely have a large impact on your model, so feel free to experiment.

Now we need to use these numbers to separate our training data and labels into batches of sequences that will be fed into the model. Luckily, keras has just the tool with their timeseries_dataset_from_array function. According to the documentation:

If targets was passed, the dataset yields tuple (batch_of_sequences, batch_of_targets). If not, the dataset yields only batch_of_sequences.

This is exactly what we need! We can easily specify timesteps (referred to as sequence_length in the documentation) and batchsize to get a feature generator and a target generator that we can pass to model.fit(). For your implementation, you can experiment with different sequence_lengths along with sequence_stride (how many observations to skip between sequences) and sampling_rate (how many observations to sample per sequence).

There's one hiccup - we need to make sure that our sequences don't span across periods. To get around that, we'll iterate through our periods and generate a timeseries dataset for each one. Then we'll concatenate them at the end to rejoin our training set and validation set. And let's not forget - since we're only allowed to use feature data up until t-1, we'll need to realign our features and labels. We'll do that during our loop as well.

In [24]:
from keras import preprocessing


data_config = {
    "timesteps": 32,
    "batch_size": 32,
}


def timeseries_dataset_from_df(df, batch_size):
    dataset = None
    timesteps = data_config["timesteps"]

    # iterate through periods
    for _, period_df in df.groupby("period"):
        # realign features and labels so that first sequence of 32 is aligned with the 33rd target
        inputs = period_df[XCOLS][:-timesteps]
        outputs = period_df[YCOLS][timesteps:]

        period_ds = preprocessing.timeseries_dataset_from_array(
            inputs,
            outputs,
            timesteps,
            batch_size=batch_size,
        )

        if dataset is None:
            dataset = period_ds
        else:
            dataset = dataset.concatenate(period_ds)

    return dataset


train_ds = timeseries_dataset_from_df(train, data_config["batch_size"])
val_ds = timeseries_dataset_from_df(val, data_config["batch_size"])

print(f"Number of train batches: {len(train_ds)}")
print(f"Number of val batches: {len(val_ds)}")
Number of train batches: 3804
Number of val batches: 276

Designing an LSTM

Finally, we need to design our LSTM network. We're going to build a simple sequential model with one hidden LSTM layer and one output later with 2 output values (t0 and t1). You can experiment by making your model deep (many layers) or wide (adding more neurons).

There are many hyperparameters that we can tune. We'll concentrate on a few here:

  • n_epochs: This determines the number of complete passes your model takes through the training data. For this tutorial, we'll choose 20 epochs. You'll want to monitor your rate of convergence to tune this number.
  • n_neurons: The number of hidden nodes. Usually increases by powers of 2. We'll use 512.
  • dropout: This regularizes by randomly "ignoring" a dropout fraction of a layer's neurons during each pass through the network during training, so that no particular neuron overfits its input. We'll start with a value of 0.4.
  • stateful: This determines whether the model keeps track of the historical data that its seen within each batch. Since each sample within a batch encodes the entire sequence we care about, we can set this to False.

Tuning these values will impact how fast your model learns, whether it converges, and affect over/under fitting. Play around to see what works best.

After instantiating the model with these hyperparameters, we'll compile it with mean_squared_error as our loss function and adam as our optimizer. We'll have to remember to take the square root of our loss to get our competition metric, root_mean_squared_error.

In [25]:
from keras.layers import Dense, LSTM
from keras.models import Sequential

# define our model
model_config = {"n_epochs": 20, "n_neurons": 512, "dropout": 0.4, "stateful": False}

model = Sequential()
model.add(
    LSTM(
        model_config["n_neurons"],
        # usually set to (`batch_size`, `sequence_length`, `n_features`)
        # setting the batch size to None allows for variable length batches
        batch_input_shape=(None, data_config["timesteps"], len(XCOLS)),
        stateful=model_config["stateful"],
        dropout=model_config["dropout"],
    )
)
model.add(Dense(len(YCOLS)))
model.compile(
    loss="mean_squared_error",
    optimizer="adam",
)

model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm (LSTM)                  (None, 512)               1081344   
_________________________________________________________________
dense (Dense)                (None, 2)                 1026      
=================================================================
Total params: 1,082,370
Trainable params: 1,082,370
Non-trainable params: 0
_________________________________________________________________

Train Model

Now let's get to training! All we have to do is call model.fit() with our timeseries_dataset, batch_size, and epochs. We'll also set verbose to 0 to keep things tidy. At home, you can set it to 1 to print out a nice progress bar and the training and validation loss after each epoch.

Finally, we'll also pass it our valdation timeseries_dataset so that we can calcuate our validation loss along with our training loss after each epoch.

In [26]:
history = model.fit(
    train_ds,
    batch_size=data_config["batch_size"],
    epochs=model_config["n_epochs"],
    verbose=0,
    shuffle=False,
    validation_data=val_ds,
)

Evaluation

Huzzah! Our model finished training. We can plot the convergence by using our history object.

In [27]:
for name, values in history.history.items():
    plt.plot(values)

It looks like it was just starting to level out. On our next iteration, we could let it train for longer or try to adjust the hyperparemeters to get better convergence. Let' see how we did on our test set.

In [28]:
test_ds = timeseries_dataset_from_df(test, data_config["batch_size"])
In [29]:
mse = model.evaluate(test_ds)
print(f"Test RMSE: {mse**.5:.2f}")
558/558 [==============================] - 12s 22ms/step - loss: 202.7513
Test RMSE: 14.24

Since our metric is RMSE, we have to remember to take the square root. Looks like we ended up with a test RMSE of 14.24. Not bad! Let's prepare our code submission to see how we did on the competition test set.

Preparing for submission

First, we have to serialize everything we'll need for prediction (serialization just means convert to bytes so that we can store it on disk). That includes our StandardScaler, some of our configurations we used during preprocessing, and last but not least our model.

Keras provides a handy model.save() method for serializing models. We'll pickle our scaler and save our config to json.

In [32]:
import json
import pickle

model.save("model")

with open("scaler.pck", "wb") as f:
    pickle.dump(scaler, f)

data_config["solar_wind_subset"] = SOLAR_WIND_FEATURES
print(data_config)
with open("config.json", "w") as f:
    json.dump(data_config, f)
INFO:tensorflow:Assets written to: model/assets
{'timesteps': 32, 'batch_size': 32, 'solar_wind_subset': ['bt', 'temperature', 'bx_gse', 'by_gse', 'bz_gse', 'speed', 'density']}

Finally, we have to prepare our predict.py file. This pulls together everything we need from our preprocessing pipeline down to prediction. We'll also have to load in any pieces that we serialized.

The following cell includes everything that we would write in predict.py.

Most crucially, it defines predict_dst. This is the function that that the code execution environment will call to get your predictions for t0 and t1 for any observations. In it, we'll process the data that we're passed and then call model.predict().

We'll also implement a simple safeguard to detect infinite or null predictions. If we return infinite or null values for t0 or t1, our submission will be automatically canceled. To avoid that, we'll return the mean dst of -12 when we detect any unexpected values. In reality, you'll want to debug your solution or implement a more sophisticated safeguard.

Check out the run time repo and code submission format page for details and how to put your code submission together.

In [31]:
import json
import pickle
from typing import Tuple

import keras
import numpy as np
import pandas as pd

# Load in serialized model, config, and scaler
model = keras.models.load_model("model")

with open("config.json", "r") as f:
    CONFIG = json.load(f)

with open("scaler.pck", "rb") as f:
    scaler = pickle.load(f)

# Set global variables    
TIMESTEPS = CONFIG["timesteps"]
SOLAR_WIND_FEATURES = [
    "bt",
    "temperature",
    "bx_gse",
    "by_gse",
    "bz_gse",
    "speed",
    "density",
]
XCOLS = (
    [col + "_mean" for col in SOLAR_WIND_FEATURES]
    + [col + "_std" for col in SOLAR_WIND_FEATURES]
    + ["smoothed_ssn"]
)


# Define functions for preprocessing
def impute_features(feature_df):
    """Imputes data using the following methods:
    - `smoothed_ssn`: forward fill
    - `solar_wind`: interpolation
    """
    # forward fill sunspot data for the rest of the month
    feature_df.smoothed_ssn = feature_df.smoothed_ssn.fillna(method="ffill")
    # interpolate between missing solar wind values
    feature_df = feature_df.interpolate()
    return feature_df


def aggregate_hourly(feature_df, aggs=["mean", "std"]):
    """Aggregates features to the floor of each hour using mean and standard deviation.
    e.g. All values from "11:00:00" to "11:59:00" will be aggregated to "11:00:00".
    """
    # group by the floor of each hour use timedelta index
    agged = feature_df.groupby([feature_df.index.floor("H")]).agg(aggs)

    # flatten hierachical column index
    agged.columns = ["_".join(x) for x in agged.columns]
    return agged


def preprocess_features(solar_wind, sunspots, scaler=None, subset=None):
    """
    Preprocessing steps:
        - Subset the data
        - Aggregate hourly
        - Join solar wind and sunspot data
        - Scale using standard scaler
        - Impute missing values
    """
    # select features we want to use
    if subset:
        solar_wind = solar_wind[subset]

    # aggregate solar wind data and join with sunspots
    hourly_features = aggregate_hourly(solar_wind).join(sunspots)

    # subtract mean and divide by standard deviation
    if scaler is None:
        scaler = StandardScaler()
        scaler.fit(hourly_features)

    normalized = pd.DataFrame(
        scaler.transform(hourly_features),
        index=hourly_features.index,
        columns=hourly_features.columns,
    )

    # impute missing values
    imputed = impute_features(normalized)

    # we want to return the scaler object as well to use later during prediction
    return imputed, scaler


# THIS MUST BE DEFINED FOR YOUR SUBMISSION TO RUN
def predict_dst(
    solar_wind_7d: pd.DataFrame,
    satellite_positions_7d: pd.DataFrame,
    latest_sunspot_number: float,
) -> Tuple[float, float]:
    """
    Take all of the data up until time t-1, and then make predictions for
    times t and t+1.
    Parameters
    ----------
    solar_wind_7d: pd.DataFrame
        The last 7 days of satellite data up until (t - 1) minutes [exclusive of t]
    satellite_positions_7d: pd.DataFrame
        The last 7 days of satellite position data up until the present time [inclusive of t]
    latest_sunspot_number: float
        The latest monthly sunspot number (SSN) to be available
    Returns
    -------
    predictions : Tuple[float, float]
        A tuple of two predictions, for (t and t + 1 hour) respectively; these should
        be between -2,000 and 500.
    """
    # Re-format data to fit into our pipeline
    sunspots = pd.DataFrame(index=solar_wind_7d.index, columns=["smoothed_ssn"])
    sunspots["smoothed_ssn"] = latest_sunspot_number
    
    # Process our features and grab last 32 (timesteps) hours
    features, s = preprocess_features(
        solar_wind_7d, sunspots, scaler=scaler, subset=SOLAR_WIND_FEATURES
    )
    model_input = features[-TIMESTEPS:][XCOLS].values.reshape(
        (1, TIMESTEPS, features.shape[1])
    )
    
    # Make a prediction
    prediction_at_t0, prediction_at_t1 = model.predict(model_input)[0]

    # Optional check for unexpected values
    if not np.isfinite(prediction_at_t0):
        prediction_at_t0 = -12
    if not np.isfinite(prediction_at_t1):
        prediction_at_t1 = -12

    return prediction_at_t0, prediction_at_t1

We can zip this file up along with our serialized model and other pieces into a file called submission.zip.

Protip: In the noaa-runtime directory, you can put all of these files into /benchmark. Running make pack-benchmark will create a submission.zip in the /submission directory.

Now we can head over to the submissions page to submit our code.

Solar flares interacting with Earth's magnetic field.

Hooray! We got an RMSE of 16.3293. That's not a bad start, but now it's up to you to improve on it.

We hope this benchmark solution provided a helpful framework for exploring the solar wind data, designing an LSTM to predict Dst, and creating a code submission. We can't wait to see what else you come up with!

Head over to the MagNet: Model the Geomagnetic Field for more details on the competition. You can also check out the forum. Happy modeling!