blog

Goodnight Moon, Hello Early Literacy Screening Benchmark


by Meral Hacikamiloglu

Welcome! This guest post from our partners at the MIT Gabrieli Lab will guide you through building a simple baseline model for the Goodnight Moon, Hello Early Literacy Screening Challenge. This benchmark will predict scores for literacy screening tasks using extracted audio features. For access to the data used in this benchmark notebook, sign up for the competition here.

This notebook will:

  • Load the dataset
  • Perform exploratory data analysis
  • Create feature representations
  • Split the data into train and test sets
  • Train an XGBoost model
  • Predict and evaluate locally
  • Prepare model code and assets for submission

Background

Literacy skills are critical to a child’s success in school and beyond, yet a significant portion of students in the US are struggling with reading abilities. Early intervention is crucial, but the current approach to literacy screening in classrooms relies heavily on teachers administering and manually scoring assessments—a process that can be time-consuming and sometimes inconsistent due to variations in scorer training and interpretation.

Reach Every Reader has developed a comprehensive literacy screening assessment. This assessment includes tasks designed to measure key language skills, such as phonological awareness and working memory. Specifically, tasks like deletion, blending, nonword repetition, and sentence repetition capture critical aspects of early literacy development. While the information gathered from these tasks is invaluable, the manual scoring process limits its potential impact.

This competition invites participants to develop machine learning models that can automatically and accurately score these audio-based literacy tasks. By building reliable models, competitors can help ease administrative load on teachers, increase scoring accuracy, and ensure more consistent support for students at risk.

Let's get started!

In [1]:
# Built-ins
import joblib
from pathlib import Path

# Audio processing libraries
import librosa
import librosa.display
import opensmile
import webrtcvad

# Machine learning and data handling
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss
from tqdm import tqdm
import xgboost as xgb
from xgboost import XGBClassifier

# Visualization
import matplotlib.pyplot as plt

Step 1: Load the Data

Ensure that the train_labels.csv file is available in the data/ folder before running this step. In this setup, the audio files are in a subfolder in data/ called audio/. You can change the paths to match your setup.

In [2]:
DATA_PATH = Path("data")
AUDIO_PATH = DATA_PATH / "audio"
In [3]:
labels = pd.read_csv(DATA_PATH / "train_labels.csv")
print(f"Train labels shape: {labels.shape}")
labels.head()
Train labels shape: (38095, 2)
Out[3]:
filename score
0 hgxrel.wav 0.0
1 ltbona.wav 0.0
2 bfaiol.wav 1.0
3 ktvyww.wav 1.0
4 htfbnp.wav 1.0
In [4]:
metadata = pd.read_csv(DATA_PATH / "train_metadata.csv")
print(f"Train metadata shape: {metadata.shape}")
metadata.head()
Train metadata shape: (38095, 4)
Out[4]:
filename task expected_text grade
0 hgxrel.wav deletion old KG
1 ltbona.wav sentence_repetition he wouldnt go with his sister because he was t... KG
2 bfaiol.wav nonword_repetition chav KG
3 ktvyww.wav sentence_repetition ring the bell on the desk to get her attention 2
4 htfbnp.wav blending kite KG

We'll join these datasets together to help with our exploratory data analysis.

In [5]:
df = labels.merge(metadata, on="filename", validate="1:1")
print(f"df shape: {df.shape}")
df.head()
df shape: (38095, 5)
Out[5]:
filename score task expected_text grade
0 hgxrel.wav 0.0 deletion old KG
1 ltbona.wav 0.0 sentence_repetition he wouldnt go with his sister because he was t... KG
2 bfaiol.wav 1.0 nonword_repetition chav KG
3 ktvyww.wav 1.0 sentence_repetition ring the bell on the desk to get her attention 2
4 htfbnp.wav 1.0 blending kite KG

Step 2: Exploratory Data Analysis

We will now explore the dataset and visualize some features.

In [6]:
def plot_waveform(filepath):
    # Load the audio file
    audio_data, sr = librosa.load(filepath, sr=None)

    # Plot the waveform
    plt.figure(figsize=(10, 4))
    librosa.display.waveshow(audio_data, sr=sr)
    plt.title("Waveform")
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.show()

    return audio_data, sr
In [7]:
def plot_spectrogram(audio_data, sr):
    # Generate the spectrogram
    S = librosa.stft(audio_data)
    S_db = librosa.amplitude_to_db(np.abs(S), ref=np.max)

    # Plot the spectrogram
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(S_db, sr=sr, x_axis="time", y_axis="log")
    plt.colorbar(format="%+2.0f dB")
    plt.title("Spectrogram")
    plt.xlabel("Time (s)")
    plt.ylabel("Frequency (Hz)")
    plt.show()
In [8]:
def voice_activity_detection(filepath, aggressiveness=2):
    vad = webrtcvad.Vad(aggressiveness)  # Aggressiveness from 0 to 3
    audio_data, sr = librosa.load(filepath, sr=16000)  # VAD prefers 16kHz audio
    audio_data = (audio_data * 32767).astype(np.int16)  # Scale to int16 for VAD

    frame_duration = 30  # Frame duration in ms
    frame_length = int(sr * frame_duration / 1000)

    # Collect VAD results
    vad_results = []
    for start in range(0, len(audio_data), frame_length):
        frame = audio_data[start : start + frame_length].tobytes()
        vad_results.append(vad.is_speech(frame, sr))

    # Plot VAD output
    time_axis = np.linspace(0, len(audio_data) / sr, num=len(vad_results))
    plt.figure(figsize=(10, 2))
    plt.plot(time_axis, vad_results, label="VAD Output")
    plt.title("Voice Activity Detection (VAD) Output")
    plt.xlabel("Time (s)")
    plt.ylabel("Speech Detected")
    plt.ylim(-0.1, 1.1)
    plt.show()
In [9]:
def analyze_audio(filepath):
    print("Plotting waveform...")
    audio_data, sr = plot_waveform(filepath)

    print("Plotting spectrogram...")
    plot_spectrogram(audio_data, sr)

    print("Performing Voice Activity Detection...")
    voice_activity_detection(filepath)

Let’s take a closer look at each task and its selected examples. For each task, we analyze paired examples — one correct and one incorrect — to better understand how variations in responses manifest in the dataset.

Deletion

Deletion evaluates a child’s phonological awareness by asking them to listen to a word and then delete part of it to form a new, sensical word. In this example, the child is prompted with “haircut without cut,” and the expected response is “hair.” In the audio file gpksml.wav, the child provides an incorrect response by providing the response "cut", failing to accurately produce the target word “hair.”

In [10]:
incorrect_deletion = "gpksml.wav"

df[df.filename == incorrect_deletion]
Out[10]:
filename score task expected_text grade
9468 gpksml.wav 0.0 deletion hair KG
In [11]:
analyze_audio(AUDIO_PATH / incorrect_deletion)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

In contrast, the audio file faudzc.wav demonstrates a correct response to the same task. The child successfully identifies and removes the portion “cut” from “haircut” to produce “hair,” indicating strong phonological awareness and the ability to modify spoken words accurately.

In [12]:
correct_deletion = "faudzc.wav"

df[df.filename == correct_deletion]
Out[12]:
filename score task expected_text grade
26274 faudzc.wav 1.0 deletion hair KG
In [13]:
analyze_audio(DATA_PATH / "audio" / correct_deletion)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

Sentence Repetition

Sentence repetition assesses the child’s ability to replicate a sentence verbatim, maintaining both structure and meaning. In this case, the sentence is “ring the bell on the desk to get her attention.” The response in khlzie.wav modifies the wording slightly, saying “ring the bell on her desk to get her attention,” which is marked incorrect due to the deviation from the original. While the meaning remains somewhat intact, the structural deviation marks it as incorrect.

In [14]:
incorrect_sentrep = "khlzie.wav"
df[df.filename == incorrect_sentrep]
Out[14]:
filename score task expected_text grade
20806 khlzie.wav 0.0 sentence_repetition ring the bell on the desk to get her attention 2
In [15]:
analyze_audio(AUDIO_PATH / incorrect_sentrep)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

The correct response in loqrbr.wav accurately replicates the entire sentence without any alterations, demonstrating the child’s strong linguistic processing and auditory memory skills.

In [16]:
correct_sentrep = "loqrbr.wav"
df[df.filename == correct_sentrep]
Out[16]:
filename score task expected_text grade
35798 loqrbr.wav 1.0 sentence_repetition ring the bell on the desk to get her attention 2
In [17]:
analyze_audio(AUDIO_PATH / correct_sentrep)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

Nonword Repetition

Nonword repetition evaluates phonological working memory by asking the child to repeat a nonsensical word. In this case, the child was asked to repeat “gowfdoikeem.” The response in dxpwed.wav contains slight phonetic inaccuracies, leading to an incorrect score.

In [18]:
incorrect_nonword = "dxpwed.wav"
df[df.filename == incorrect_nonword]
Out[18]:
filename score task expected_text grade
11168 dxpwed.wav 0.0 nonword_repetition gowfdoikeem KG
In [19]:
analyze_audio(AUDIO_PATH / incorrect_nonword)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

In contrast, the response in bnafxc.wav correctly reproduces the nonword “gowfdoikeem,” showcasing the child’s ability to retain and articulate unfamiliar sound patterns. This indicates strong phonological working memory.

In [20]:
correct_nonword = "bnafxc.wav"
df[df.filename == correct_nonword]
Out[20]:
filename score task expected_text grade
22051 bnafxc.wav 1.0 nonword_repetition gowfdoikeem KG
In [21]:
analyze_audio(AUDIO_PATH / correct_nonword)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

Blending

Blending assesses the child’s ability to combine separate phonemes to form a complete word. In this example, the task asks the child to blend the sounds m - ou - se to produce “mouse.” The response in hvqvny.wav includes slight phonetic differences, which result in an incorrect response.

In [22]:
incorrect_blending = "hvqvny.wav"
df[df.filename == incorrect_blending]
Out[22]:
filename score task expected_text grade
12550 hvqvny.wav 0.0 blending mouse KG
In [23]:
analyze_audio(AUDIO_PATH / incorrect_blending)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

The correct response in jkvyty.wav illustrates a successful blending of the same sounds m - ou - se into the target word “mouse.” This response shows phonological awareness and ability to process sounds accurately.

In [24]:
correct_blending = "jkvyty.wav"
df[df.filename == correct_blending]
Out[24]:
filename score task expected_text grade
34318 jkvyty.wav 1.0 blending mouse KG
In [25]:
analyze_audio(AUDIO_PATH / correct_blending)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

Step 3: Feature Engineering

In this step, we will generate features that will be used by our model.

The eGeMAPS (Extended Geneva Minimalistic Acoustic Parameter Set) feature set is a tool for analyzing speech and is implemented through the openSMILE framework (open-source Speech and Music Interpretation by a Large collection of Extractors). This toolkit extracts 88 key acoustic features, focusing on elements like pitch, loudness, and pauses, which are critical for evaluating speech clarity and fluency.

eGeMAPS is compact but highly effective, capturing important speech features while remaining efficient. Its wide use in speech research makes it a strong fit for scoring literacy assessments. By leveraging openSMILE, we provide our model with detailed, meaningful data.

In [26]:
# Initialize OpenSMILE with eGeMAPS configuration for extracting features
smile = opensmile.Smile(
    feature_set=opensmile.FeatureSet.eGeMAPSv02,  # Use eGeMAPS for feature extraction
    feature_level=opensmile.FeatureLevel.Functionals,  # Extract summary statistics
)
In [27]:
# Extract features with tqdm progress bar
feature_list = []
for filename in tqdm(df.filename, desc="Extracting OpenSMILE Features", unit="file"):
    features = smile.process_file(AUDIO_PATH / filename)  # Extract features for each file
    feature_list.append(features.mean(axis=0))  # Take the mean across time for stability
Extracting OpenSMILE Features: 100%|█████████████████| 38095/38095 [37:02<00:00, 17.14file/s]
In [28]:
# Convert extracted features to DataFrame
X = pd.DataFrame(feature_list, index=df.filename)

# Set up target variable
y = df.score

Step 4: Train-Test Split

We will split the data into training and testing sets to evaluate the model.

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")
Training set size: (30476, 88)
Test set size: (7619, 88)

Step 5: Train a Baseline Model

We will train a simple XGBoost classifier using the generated features and evaluate its performance.

In [30]:
# Initialize and train the XGBoost model
xgb_model = XGBClassifier(n_estimators=100, random_state=42, eval_metric="logloss")

calibrated_model = CalibratedClassifierCV(xgb_model, cv=3)
calibrated_model.fit(X_train, y_train)

# Make predictions and evaluate using log loss
y_pred_proba = calibrated_model.predict_proba(X_test)[
    :, 1
]  # Probability of 'correct' (class 1)
logloss = log_loss(y_test, y_pred_proba)
print(f"Log Loss on the test set: {logloss}")
Log Loss on the test set: 0.6104908164544623


Since this is a code execution competition, we will submit our model weights and code rather than predictions. Let's first train our model on the full training set (X) instead of the 80% split (X_train) we used for local iteration.

In [31]:
calibrated_model.fit(X, y)
Out[31]:
CalibratedClassifierCV(cv=3,
                       estimator=XGBClassifier(base_score=None, booster=None,
                                               callbacks=None,
                                               colsample_bylevel=None,
                                               colsample_bynode=None,
                                               colsample_bytree=None,
                                               device=None,
                                               early_stopping_rounds=None,
                                               enable_categorical=False,
                                               eval_metric='logloss',
                                               feature_types=None, gamma=None,
                                               grow_policy=None,
                                               importance_type=None,
                                               interaction_constraints=None,
                                               learning_rate=None, max_bin=None,
                                               max_cat_threshold=None,
                                               max_cat_to_onehot=None,
                                               max_delta_step=None,
                                               max_depth=None, max_leaves=None,
                                               min_child_weight=None,
                                               missing=nan,
                                               monotone_constraints=None,
                                               multi_strategy=None,
                                               n_estimators=100, n_jobs=None,
                                               num_parallel_tree=None,
                                               random_state=42, ...))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Step 6: Write main.py and save out assets for submission

Now that we have a trained model, we will package up our model and inference code for predicting on the test set in the runtime container. Predictions will be written out in the expected submission.csv format. For more details, see the Code Submission Format page.

Submission format requirements

  • Your submission must be a .zip archive (e.g., submission.zip) containing a main.py file at the root level.
  • The main.py script should:
    • Load your pretrained model (which should be included in your submission) and perform inference on the test audio clips.
    • Write predictions to a file named submission.csv in the same directory as main.py.

Competition runtime limits

Before preparing your submission, please ensure you understand the runtime environment and its constraints:

  • Your submission must run using Python 3.12 with packages specified in the runtime repository.
  • The submission must complete execution within 4 hours or less. Most submissions are expected to run much faster.
  • The runtime container provides access to a single GPU. All code must execute within the GPU environment, although computations may still occur on the CPU. (A CPU environment is available within the container for local debugging.)
  • The container has access to the following resources:
    • 16 vCPUs
    • 110GB RAM
    • A single NVIDIA T4 GPU with 16GB VRAM
  • The container does not have network access. All required files, including code and model assets, must be included in your submission.
  • The container execution will not have root access to the filesystem.

Please ensure your submission complies with these limits to guarantee successful execution.

Preparing our submission

Our first step will be to save out the calibrated XGBoost model we have just trained on the entire training dataset. We will save this out as a .joblib file, though there are other formats you can use.

In [32]:
ASSETS_DIR = Path("assets")
ASSETS_DIR.mkdir(exist_ok=True)

joblib.dump(calibrated_model, ASSETS_DIR / "calibrated_model.joblib")
Out[32]:
['assets/calibrated_model.joblib']


Now we'll write out our main.py file. Below is an example of a properly formatted main.py file that runs our model pipeline:

import joblib
from pathlib import Path

import opensmile
import pandas as pd


DATA_PATH = Path("data")


def main():
    # load submission format
    sub_format = pd.read_csv("data/submission_format.csv", index_col="filename")

    # initialize OpenSMILE with eGeMAPS configuration for extracting features
    smile = opensmile.Smile(
        feature_set=opensmile.FeatureSet.eGeMAPSv02,  # Use eGeMAPS for feature extraction
        feature_level=opensmile.FeatureLevel.Functionals  # Extract summary statistics
    )

    # create features
    feature_list = []
    for filename in sub_format.index:
        features = smile.process_file(DATA_PATH / filename)  # extract features for each file
        feature_list.append(features.mean(axis=0))  # take the mean across time for stability

    features = pd.DataFrame(feature_list, index=sub_format.index)

    # load model
    model = joblib.load("assets/calibrated_model.joblib")

    # make predictions
    preds =  model.predict_proba(features)[:, 1]

    # write out to submission format
    sub_format["score"] = preds
    sub_format.to_csv("submission.csv")


if __name__ == "__main__":
    main()

Step 7: Test your submission locally

The competition provides a streamlined process for testing submissions locally using a Docker container that mimics the runtime environment. To ensure your submission works correctly, you can test it locally using the competition's runtime repository. Since there are detailed instructions in the README, we'll just provide a concise guide here.

  1. Clone the runtime repo:

    git clone https://github.com/drivendataorg/literacy-screening-runtime.git
    
  2. Download the official competition Docker image:

    make pull
    
  3. Set up your data directory:

    • Download the smoke test data from the data download page
    • Extract this into your data/ directory for local testing with:
      tar xzvf smoke.tar.gz --strip-components=1 -C data/
      
    • In the local container, /code_execution/data is a mounted version of your local data/ folder. The official runtime replaces this with the actual test data.
  4. Prepare your submission:

    • Save all submission files (e.g., main.py, model weights) in the submission_src folder.
    • Create the submission.zip file with:
      make pack-submission
      
  5. Run your submission locally against the smoke test data:

    • Test your submission in the runtime container:
      make test-submission
      
    • This runs your main.py script in the container and generates submission.csv in the submission/ folder.
    • Use the logs saved in submission/log.txt to debug any errors. These logs will help you identify issues with your code or the runtime environment.

Make sure your code runs smoothly locally before submitting to the competition platform!

Step 8: Submit to the competition!

Now that we've saved out our model and written our main.py, this is what our submission_src directory looks like:

❯ tree submission_src
submission_src
├── assets
│   └── calibrated_model.joblib
└── main.py


Let's generate our submission zipfile with make pack-submission:

❯ make pack-submission
mkdir -p submission/
cd submission_src; zip -r ../submission/submission.zip ./*
  adding: assets/ (stored 0%)
  adding: assets/calibrated_model.joblib (deflated 65%)
  adding: main.py (deflated 52%)


Now we're ready to submit our submission.zip to the competition! We've got the option of submitting a normal submission or smoke test. Smoke tests run your submission against a small subset of the training data for faster debugging. These tests won’t count for prizes but are helpful for identifying errors. You should run a smoke test submission before a normal submission to ensure your code executes properly.

Screenshot of the competition submission page showing the options of a normal submission or a smoke test submission.

Benchmark submission score

Our benchmark model achieved a log loss of 0.6063 and an AUROC of 0.7327. While there’s room for improvement, these results demonstrate the model’s potential for automating literacy task scoring and provide a strong foundation for further refinement.

Conclusion

This notebook provides a foundational pipeline for building a machine learning model for the Goodnight Moon, Hello Early Literacy Screening Challenge. It demonstrates how to preprocess the dataset, extract meaningful features, and train a benchmark model. While this is a strong starting point, there are many opportunities to improve performance by experimenting with advanced feature engineering, alternative model architectures, and optimized hyperparameters.

This challenge represents an exciting opportunity to make a tangible impact on early childhood education by helping automate and improve literacy assessments. We encourage you to iterate on this approach, explore innovative ideas, and push the boundaries of what machine learning can achieve in this domain.

If you want to share any of your findings or have questions, feel free to post on the community forum.

Good luck, and we’re excited to see how your models contribute to improving literacy outcomes!