blog

Goodnight Moon, Hello Early Literacy Screening Benchmark


by Meral Hacikamiloglu

Welcome! This guest post from our partners at the MIT Gabrieli Lab will guide you through building a simple baseline model for the Goodnight Moon, Hello Early Literacy Screening Challenge. This benchmark will predict scores for literacy screening tasks using extracted audio features. For access to the data used in this benchmark notebook, sign up for the competition here.

This notebook will:

  • Load the dataset
  • Perform exploratory data analysis
  • Create feature representations
  • Split the data into train and test sets
  • Train an XGBoost model
  • Predict and evaluate locally
  • Prepare model code and assets for submission

Background

Literacy skills are critical to a child’s success in school and beyond, yet a significant portion of students in the US are struggling with reading abilities. Early intervention is crucial, but the current approach to literacy screening in classrooms relies heavily on teachers administering and manually scoring assessments—a process that can be time-consuming and sometimes inconsistent due to variations in scorer training and interpretation.

Reach Every Reader has developed a comprehensive literacy screening assessment. This assessment includes tasks designed to measure key language skills, such as phonological awareness and working memory. Specifically, tasks like deletion, blending, nonword repetition, and sentence repetition capture critical aspects of early literacy development. While the information gathered from these tasks is invaluable, the manual scoring process limits its potential impact.

This competition invites participants to develop machine learning models that can automatically and accurately score these audio-based literacy tasks. By building reliable models, competitors can help ease administrative load on teachers, increase scoring accuracy, and ensure more consistent support for students at risk.

Let's get started!

In [1]:
# Built-ins
import joblib
from pathlib import Path

# Audio processing libraries
import librosa
import librosa.display
import opensmile
import webrtcvad

# Machine learning and data handling
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss
from tqdm import tqdm
import xgboost as xgb
from xgboost import XGBClassifier

# Visualization
import matplotlib.pyplot as plt

Step 1: Load the Data

Ensure that the train_labels.csv file is available in the data/ folder before running this step. In this setup, the audio files are in a subfolder in data/ called audio/. You can change the paths to match your setup.

In [2]:
DATA_PATH = Path("data")
AUDIO_PATH = DATA_PATH / "audio"
In [3]:
labels = pd.read_csv(DATA_PATH / "train_labels.csv")
print(f"Train labels shape: {labels.shape}")
labels.head()
Train labels shape: (38095, 2)
Out[3]:
filename score
0 hgxrel.wav 0.0
1 ltbona.wav 0.0
2 bfaiol.wav 1.0
3 ktvyww.wav 1.0
4 htfbnp.wav 1.0
In [4]:
metadata = pd.read_csv(DATA_PATH / "train_metadata.csv")
print(f"Train metadata shape: {metadata.shape}")
metadata.head()
Train metadata shape: (38095, 4)
Out[4]:
filename task expected_text grade
0 hgxrel.wav deletion old KG
1 ltbona.wav sentence_repetition he wouldnt go with his sister because he was t... KG
2 bfaiol.wav nonword_repetition chav KG
3 ktvyww.wav sentence_repetition ring the bell on the desk to get her attention 2
4 htfbnp.wav blending kite KG

We'll join these datasets together to help with our exploratory data analysis.

In [5]:
df = labels.merge(metadata, on="filename", validate="1:1")
print(f"df shape: {df.shape}")
df.head()
df shape: (38095, 5)
Out[5]:
filename score task expected_text grade
0 hgxrel.wav 0.0 deletion old KG
1 ltbona.wav 0.0 sentence_repetition he wouldnt go with his sister because he was t... KG
2 bfaiol.wav 1.0 nonword_repetition chav KG
3 ktvyww.wav 1.0 sentence_repetition ring the bell on the desk to get her attention 2
4 htfbnp.wav 1.0 blending kite KG

Step 2: Exploratory Data Analysis

We will now explore the dataset and visualize some features.

In [6]:
def plot_waveform(filepath):
    # Load the audio file
    audio_data, sr = librosa.load(filepath, sr=None)

    # Plot the waveform
    plt.figure(figsize=(10, 4))
    librosa.display.waveshow(audio_data, sr=sr)
    plt.title("Waveform")
    plt.xlabel("Time (s)")
    plt.ylabel("Amplitude")
    plt.show()

    return audio_data, sr
In [7]:
def plot_spectrogram(audio_data, sr):
    # Generate the spectrogram
    S = librosa.stft(audio_data)
    S_db = librosa.amplitude_to_db(np.abs(S), ref=np.max)

    # Plot the spectrogram
    plt.figure(figsize=(10, 4))
    librosa.display.specshow(S_db, sr=sr, x_axis="time", y_axis="log")
    plt.colorbar(format="%+2.0f dB")
    plt.title("Spectrogram")
    plt.xlabel("Time (s)")
    plt.ylabel("Frequency (Hz)")
    plt.show()
In [8]:
def voice_activity_detection(filepath, aggressiveness=2):
    vad = webrtcvad.Vad(aggressiveness)  # Aggressiveness from 0 to 3
    audio_data, sr = librosa.load(filepath, sr=16000)  # VAD prefers 16kHz audio
    audio_data = (audio_data * 32767).astype(np.int16)  # Scale to int16 for VAD

    frame_duration = 30  # Frame duration in ms
    frame_length = int(sr * frame_duration / 1000)

    # Collect VAD results
    vad_results = []
    for start in range(0, len(audio_data), frame_length):
        frame = audio_data[start : start + frame_length].tobytes()
        vad_results.append(vad.is_speech(frame, sr))

    # Plot VAD output
    time_axis = np.linspace(0, len(audio_data) / sr, num=len(vad_results))
    plt.figure(figsize=(10, 2))
    plt.plot(time_axis, vad_results, label="VAD Output")
    plt.title("Voice Activity Detection (VAD) Output")
    plt.xlabel("Time (s)")
    plt.ylabel("Speech Detected")
    plt.ylim(-0.1, 1.1)
    plt.show()
In [9]:
def analyze_audio(filepath):
    print("Plotting waveform...")
    audio_data, sr = plot_waveform(filepath)

    print("Plotting spectrogram...")
    plot_spectrogram(audio_data, sr)

    print("Performing Voice Activity Detection...")
    voice_activity_detection(filepath)

Let’s take a closer look at each task and its selected examples. For each task, we analyze paired examples — one correct and one incorrect — to better understand how variations in responses manifest in the dataset.

Deletion

Deletion evaluates a child’s phonological awareness by asking them to listen to a word and then delete part of it to form a new, sensical word. In this example, the child is prompted with “haircut without cut,” and the expected response is “hair.” In the audio file gpksml.wav, the child provides an incorrect response by providing the response "cut", failing to accurately produce the target word “hair.”

In [10]:
incorrect_deletion = "gpksml.wav"

df[df.filename == incorrect_deletion]
Out[10]:
filename score task expected_text grade
9468 gpksml.wav 0.0 deletion hair KG
In [11]:
analyze_audio(AUDIO_PATH / incorrect_deletion)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

In contrast, the audio file faudzc.wav demonstrates a correct response to the same task. The child successfully identifies and removes the portion “cut” from “haircut” to produce “hair,” indicating strong phonological awareness and the ability to modify spoken words accurately.

In [12]:
correct_deletion = "faudzc.wav"

df[df.filename == correct_deletion]
Out[12]:
filename score task expected_text grade
26274 faudzc.wav 1.0 deletion hair KG
In [13]:
analyze_audio(DATA_PATH / "audio" / correct_deletion)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

Sentence Repetition

Sentence repetition assesses the child’s ability to replicate a sentence verbatim, maintaining both structure and meaning. In this case, the sentence is “ring the bell on the desk to get her attention.” The response in khlzie.wav modifies the wording slightly, saying “ring the bell on her desk to get her attention,” which is marked incorrect due to the deviation from the original. While the meaning remains somewhat intact, the structural deviation marks it as incorrect.

In [14]:
incorrect_sentrep = "khlzie.wav"
df[df.filename == incorrect_sentrep]
Out[14]:
filename score task expected_text grade
20806 khlzie.wav 0.0 sentence_repetition ring the bell on the desk to get her attention 2
In [15]:
analyze_audio(AUDIO_PATH / incorrect_sentrep)
Plotting waveform...
Plotting spectrogram...
Performing Voice Activity Detection...

The correct response in loqrbr.wav accurately replicates the entire sentence without any alterations, demonstrating the child’s strong linguistic processing and auditory memory skills.

In [16]:
correct_sentrep = "loqrbr.wav"
df[df.filename == correct_sentrep]
Out[16]:
filename score task expected_text grade
35798 loqrbr.wav 1.0 sentence_repetition ring the bell on the desk to get her attention 2
In [17]:
analyze_audio(AUDIO_PATH / correct_sentrep)
Plotting waveform...
Plotting spectrogram...