by
Meral Hacikamiloglu
Welcome! This guest post from our partners at the MIT Gabrieli Lab will guide you through building a simple baseline model for the Goodnight Moon, Hello Early Literacy Screening Challenge. This benchmark will predict scores for literacy screening tasks using extracted audio features. For access to the data used in this benchmark notebook, sign up for the competition here.
This notebook will:
- Load the dataset
- Perform exploratory data analysis
- Create feature representations
- Split the data into train and test sets
- Train an XGBoost model
- Predict and evaluate locally
- Prepare model code and assets for submission
Background¶
Literacy skills are critical to a child’s success in school and beyond, yet a significant portion of students in the US are struggling with reading abilities. Early intervention is crucial, but the current approach to literacy screening in classrooms relies heavily on teachers administering and manually scoring assessments—a process that can be time-consuming and sometimes inconsistent due to variations in scorer training and interpretation.
Reach Every Reader has developed a comprehensive literacy screening assessment. This assessment includes tasks designed to measure key language skills, such as phonological awareness and working memory. Specifically, tasks like deletion, blending, nonword repetition, and sentence repetition capture critical aspects of early literacy development. While the information gathered from these tasks is invaluable, the manual scoring process limits its potential impact.
This competition invites participants to develop machine learning models that can automatically and accurately score these audio-based literacy tasks. By building reliable models, competitors can help ease administrative load on teachers, increase scoring accuracy, and ensure more consistent support for students at risk.
Let's get started!
# Built-ins
import joblib
from pathlib import Path
# Audio processing libraries
import librosa
import librosa.display
import opensmile
import webrtcvad
# Machine learning and data handling
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import log_loss
from tqdm import tqdm
import xgboost as xgb
from xgboost import XGBClassifier
# Visualization
import matplotlib.pyplot as plt
Step 1: Load the Data¶
Ensure that the train_labels.csv
file is available in the data/
folder before running this step. In this setup, the audio files are in a subfolder in data/
called audio/
. You can change the paths to match your setup.
DATA_PATH = Path("data")
AUDIO_PATH = DATA_PATH / "audio"
labels = pd.read_csv(DATA_PATH / "train_labels.csv")
print(f"Train labels shape: {labels.shape}")
labels.head()
metadata = pd.read_csv(DATA_PATH / "train_metadata.csv")
print(f"Train metadata shape: {metadata.shape}")
metadata.head()
We'll join these datasets together to help with our exploratory data analysis.
df = labels.merge(metadata, on="filename", validate="1:1")
print(f"df shape: {df.shape}")
df.head()
Step 2: Exploratory Data Analysis¶
We will now explore the dataset and visualize some features.
def plot_waveform(filepath):
# Load the audio file
audio_data, sr = librosa.load(filepath, sr=None)
# Plot the waveform
plt.figure(figsize=(10, 4))
librosa.display.waveshow(audio_data, sr=sr)
plt.title("Waveform")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()
return audio_data, sr
def plot_spectrogram(audio_data, sr):
# Generate the spectrogram
S = librosa.stft(audio_data)
S_db = librosa.amplitude_to_db(np.abs(S), ref=np.max)
# Plot the spectrogram
plt.figure(figsize=(10, 4))
librosa.display.specshow(S_db, sr=sr, x_axis="time", y_axis="log")
plt.colorbar(format="%+2.0f dB")
plt.title("Spectrogram")
plt.xlabel("Time (s)")
plt.ylabel("Frequency (Hz)")
plt.show()
def voice_activity_detection(filepath, aggressiveness=2):
vad = webrtcvad.Vad(aggressiveness) # Aggressiveness from 0 to 3
audio_data, sr = librosa.load(filepath, sr=16000) # VAD prefers 16kHz audio
audio_data = (audio_data * 32767).astype(np.int16) # Scale to int16 for VAD
frame_duration = 30 # Frame duration in ms
frame_length = int(sr * frame_duration / 1000)
# Collect VAD results
vad_results = []
for start in range(0, len(audio_data), frame_length):
frame = audio_data[start : start + frame_length].tobytes()
vad_results.append(vad.is_speech(frame, sr))
# Plot VAD output
time_axis = np.linspace(0, len(audio_data) / sr, num=len(vad_results))
plt.figure(figsize=(10, 2))
plt.plot(time_axis, vad_results, label="VAD Output")
plt.title("Voice Activity Detection (VAD) Output")
plt.xlabel("Time (s)")
plt.ylabel("Speech Detected")
plt.ylim(-0.1, 1.1)
plt.show()
def analyze_audio(filepath):
print("Plotting waveform...")
audio_data, sr = plot_waveform(filepath)
print("Plotting spectrogram...")
plot_spectrogram(audio_data, sr)
print("Performing Voice Activity Detection...")
voice_activity_detection(filepath)
Let’s take a closer look at each task and its selected examples. For each task, we analyze paired examples — one correct and one incorrect — to better understand how variations in responses manifest in the dataset.
Deletion¶
Deletion evaluates a child’s phonological awareness by asking them to listen to a word and then delete part of it to form a new, sensical word. In this example, the child is prompted with “haircut without cut,” and the expected response is “hair.” In the audio file gpksml.wav
, the child provides an incorrect response by providing the response "cut", failing to accurately produce the target word “hair.”
incorrect_deletion = "gpksml.wav"
df[df.filename == incorrect_deletion]
analyze_audio(AUDIO_PATH / incorrect_deletion)
In contrast, the audio file faudzc.wav
demonstrates a correct response to the same task. The child successfully identifies and removes the portion “cut” from “haircut” to produce “hair,” indicating strong phonological awareness and the ability to modify spoken words accurately.
correct_deletion = "faudzc.wav"
df[df.filename == correct_deletion]
analyze_audio(DATA_PATH / "audio" / correct_deletion)
Sentence Repetition¶
Sentence repetition assesses the child’s ability to replicate a sentence verbatim, maintaining both structure and meaning. In this case, the sentence is “ring the bell on the desk to get her attention.” The response in khlzie.wav
modifies the wording slightly, saying “ring the bell on her desk to get her attention,” which is marked incorrect due to the deviation from the original. While the meaning remains somewhat intact, the structural deviation marks it as incorrect.
incorrect_sentrep = "khlzie.wav"
df[df.filename == incorrect_sentrep]
analyze_audio(AUDIO_PATH / incorrect_sentrep)
The correct response in loqrbr.wav
accurately replicates the entire sentence without any alterations, demonstrating the child’s strong linguistic processing and auditory memory skills.
correct_sentrep = "loqrbr.wav"
df[df.filename == correct_sentrep]
analyze_audio(AUDIO_PATH / correct_sentrep)