Welcome to the reference implementation notebook for the On Top of Pasketti: Children's Speech Recognition Challenge - Phonetic Track! If you are just getting started, we recommend reading the competition webpage first.
The goal of this tutorial is to:
- Demonstrate how to load and explore the data.
- Provide a basic framework for building a model.
- Walk through how to package your work correctly for submission.
We will be fine-tuning Wav2Vec2, a pretrained speech representation model for automatic speech recognition ("ASR"), using the Hugging Face Transformers library. Wav2Vec2 converts raw audio into contextual acoustic representations through a convolutional feature extractor followed by a Transformer encoder.
For this challenge, we adapt Wav2Vec2 to predict phonetic units (phones) represented by the International Phonetic Alphabet (IPA) rather than words or characters. While Wav2Vec2 is commonly fine-tuned for grapheme- or word-level ASR, its learned acoustic representations can also support phone-level prediction. By training the model with an IPA phone vocabulary and using Connectionist Temporal Classification (CTC), we can learn to predict phone sequences directly from audio without requiring manual time alignment.
You can either expand on and improve the method in this reference implementation, or start with something completely different! Let's get started.
Background¶
Spoken language is a natural way for kids to learn, explore, and show what they know, yet today's ASR technology hardly understands them. Most ASR systems are built on adult speech, and struggle with the pitch, rhythm, and evolving articulation of young learners.
The On Top of Pasketti: Children’s Speech Recognition Challenge assembles pre-existing and newly labeled datasets to advance speech models that truly work for children. Your goal in the Phonetic Track is to develop models that accurately predict the speech sounds, or phones, spoken by children in audio clips. Phonetic models are critical for diagnostic applications like speech pathology screening.
This is a code execution challenge! Rather than submitting your predicted labels, you will package your trained model and the prediction code and submit that for containerized execution. See the code submission format webpage and the runtime repository for more information.
If you'd like to rerun this notebook, the notebook file can be downloaded from the reference implementation repository. That repository also includes all code imported into the notebook.
Step 0: Import packages¶
First, create your environment. We use uv as the package manager in this reference implementation repository.
- Create an environment:
just create-environment - Activate the environment:
source ./.venv/bin/activate - Install the requirements found in the TOML file into the environment:
just requirements
Remember, the runtime repository's TOML file lists the packages that will be available for running inference using model submissions.
We'll be using PyTorch and Hugging Face Transformers to build our model along with standard data science Python libraries to explore and prepare the data. Because this is a code execution challenge, we'll also be testing our solutions locally before packaging our model and inference code for submission. To help us with scoring, we've imported some utility functions from the competition's runtime repository.
# Standard library
from dataclasses import dataclass
import json
from pathlib import Path
import random
from typing import Dict, List, Union, Optional
# Data Science & Utilities
from IPython.display import display
from loguru import logger
import numpy as np
import pandas as pd
import tqdm
# Visualization
from matplotlib import ticker
import matplotlib.pyplot as plt
# Core ML & Audio Stack
from datasets import Dataset, Audio, Features, Value, load_from_disk
import torch
from transformers import (
Wav2Vec2CTCTokenizer,
Wav2Vec2FeatureExtractor,
Wav2Vec2Processor,
Wav2Vec2ForCTC,
TrainingArguments,
Trainer,
)
# Project Utilities
from asr_benchmark.config import DATA_ROOT, PROJECT_ROOT
from asr_benchmark.score import VALID_IPA_CHARS, score_ipa_cer
pd.options.display.max_rows = 200
pd.options.display.max_colwidth = 1200
# Force 'auto' to use the standard console tqdm
tqdm.auto.tqdm = tqdm.tqdm
Step 1: Load and explore the data¶
First, you'll likely want to set up your own repository for developing a solution. We recommend using Cookiecutter Data Science, which ensures an easy-to-navigate project structure.
We'll download all of the competition data to our "raw" folder. There are two distinct training corpora that share the same structure but contain different data, and are hosted in separate locations for participant access. One corpus is hosted on the DrivenData platform, while a second corpus, which follows the same schema but contains different data, is provided by TalkBank. See the Data Download page for access instructions.
Our local data structure after downloading all files to a raw data folder is:
childrens-speech-recognition-benchmark-pub/data/raw
├── drivendata
│ ├── audio.zip
│ └── train_phon_transcripts.jsonl
└── talkbank
├── audio.zip
└── train_phon_transcripts.jsonl
After unzipping the audio, we can start exploring the data!
For each of the two corpora, the file train_phon_transcripts.jsonl contains the following fields:
utterance_id(str) - unique identifier for each utterancechild_id(str) - unique, anonymized identifier for the speakersession_id(str) - unique identifier for the recording session; a single child_id may be associated with multiple session_idsaudio_path(str) - path to the corresponding .flac audio file relative to the /audio directory, following the pattern audio/{utterance_id}.flacaudio_duration_sec(float) - duration of the audio clip in secondsage_bucket(str) - age range of the child at the time of recording ("3-4", "5-7", "8-11", "12+", or "unknown")md5_hash(str) - MD5 checksum of the audio file, used for integrity verificationfilesize_bytes(int) - size of the audio file in bytesphonetic_text(str) - phonetic transcription of the utterance using the International Phonetic Alphabet (IPA)
Each line in the JSONL manifest corresponds to a single utterance and references exactly one associated audio file. The phonetic_text field contains a manually created, minimally normalized phonetic transcription that serves as the training label.
Let's explore the metadata!¶
We will load the JSONL transcripts and explore some of the metadata. As a starting point, it is helpful to know how many utterances we have, how many unique children are present, the total audio time, the distribution of audio clip durations, and the distribution of child ages.
def read_transcripts(data_dir: Path) -> pd.DataFrame:
"""Read JSONL transcript file into a DataFrame and convert audio paths to absolute paths."""
transcript_path = data_dir / "train_phon_transcripts.jsonl"
df = pd.read_json(transcript_path, lines=True)
logger.info(f"Loaded {len(df)} utterance transcripts")
df["audio_relpath"] = df["audio_path"]
df["audio_path"] = df["audio_relpath"].map(lambda p: str(data_dir / p))
return df
df_dd = read_transcripts(DATA_ROOT / "raw" / "drivendata")
df_tb = read_transcripts(DATA_ROOT / "raw" / "talkbank")
df = pd.concat([df_dd, df_tb], ignore_index=True)
df.drop(columns=["audio_path"]).head()
df.utterance_id.nunique()
df.child_id.nunique()
round(df.audio_duration_sec.sum() / (60**2))
There are over 153,000 utterances in the training dataset, across 1,003 children, totaling 85 hours of audio data. Next, let's take a look at the distribution of audio clip durations.
bins = list(range(0, 21)) + [np.inf]
labels = [str(i) for i in range(0, 20)] + ["20+"]
binned = pd.cut(df.audio_duration_sec, bins=bins, labels=labels, right=False)
counts = binned.value_counts().sort_index()
counts.plot(kind="bar")
plt.xlabel("Audio Duration (sec)")
plt.ylabel("Number of Audio Clips")
plt.title("Distribution of Audio Durations")
plt.xticks(rotation=90)
plt.show()
Most audio clips are very short (1-3 seconds). Even though the audio has been clipped to the utterance level, we have some outliers over 20 seconds. Next, let's look at the distribution of utterances by child age.
df["age_bucket"] = pd.Categorical(
df["age_bucket"], categories=["unknown", "3-4", "5-7", "8-11", "12+"], ordered=True
)
fig, ax = plt.subplots()
df["age_bucket"].value_counts(normalize=True, sort=False).plot.barh(ax=ax)
ax.set_title("Utterances by Age Group")
ax.set_xlabel("Percent of Utterances")
ax.set_ylabel("Age Group")
ax.xaxis.set_major_formatter(ticker.PercentFormatter(1.0))
ax.bar_label(ax.containers[0], fmt=lambda x: f"{x * 100:.0f}%")
plt.show()
All age buckets are well represented:
- 24% of the utterances come from 3 to 4 year olds
- 16% of the utterances come from 5 to 7 year olds
- 38% of the utterances come from 8 to 11 year olds
- 18% of the utterances come from 12 year olds and older
Let's explore the utterances!¶
We will listen to an example utterance and explore its phonetic transcription.
Children’s speech often includes subtle pronunciation differences when compared to adult speech. In the Phonetic Track, models must learn to map pronunciations that vary by age, development, and region to target labels that reflect the phones each child actually produced.
df[df.utterance_id == "U_1c8757065e355c35"][
["utterance_id", "audio_duration_sec", "phonetic_text"]
]
The ground truth phonetic_text labels are normalized phonetic transcriptions of individual utterances using the International Phonetic Alphabet (IPA), with a one-to-one mapping between Unicode characters and phones. Each transcription captures the full sequence of speech sounds in the corresponding audio clip and may include substitutions, omissions, or non-standard productions that are typically ignored in word-level ASR.
All phonetic labels are restricted to the predefined IPA character set used during phonetic transcription. This set is provided in the scoring script in the runtime repository for local validation of predictions.
Step 2: Build the Model¶
A straightforward modeling option is to start from a strong pretrained ASR model, then fine-tune on our labeled phonetic child-speech training set. In this tutorial, we fine-tune Facebook's pretrained Wav2Vec2-base model using Hugging Face Transformers.
Wav2Vec2 is relatively simple and efficient to fine-tune, making it a reasonable starting point for this challenge. It uses a convolutional neural network ("CNN") feature extractor followed by a transformer encoder to learn audio representations.
- A CNN is a neural network that learns useful features from data by applying small pattern detectors called filters. These filters scan across the input (such as an image or audio signal) and learn to recognize important patterns.
- Wav2Vec2 is pretrained on unlabeled speech audio and later fine-tuned for ASR using a defined output vocabulary, typically consisting of characters. For this challenge, we instead define a vocabulary of IPA phone symbols, mapping each phonetic symbol to a unique integer ID.
We will freeze the feature extractor to preserve the robust pre-trained audio processing capabilities, then fine-tune the transformer encoder and a newly initialized CTC head configured for our phonetic character vocabulary. Hugging Face makes this process easier by providing model architectures, data processing utilities (tokenizers, feature extractors, data collators), and integrated training pipelines.
Key packages include:
transformersfor Wav2Vec2 model + training utilitiesdatasetsfor data loading and preprocessingtorchfor the training backend
1. Prepare Dataset¶
We need to process our dataframe containing the DrivenData and TalkBank datasets. We filter out clips longer than 25 seconds, which strain computer memory. Competitors may want to further split these clips to avoid losing training data. We remove one corrupted file before limiting the data to just the phonetic transcription and the audio filepath.
# Filter down to audio less than 25 seconds to reduce strain on memory
df = df[(df.audio_duration_sec <= 25)]
# Filter out corrupted file
df = df[df.utterance_id != "U_b8a4e8220e65219b"]
# For now, we only need the transcript and the audio path
df = df[["phonetic_text", "audio_path"]]
We then create a Hugging Face Dataset from our dataframe, which gives us a standard format that works cleanly with dataset transforms and the training pipeline. We also cast the audio column to 16 kHz so clips are decoded and resampled to the sampling rate expected by Wav2Vec2.
# Audio sampling rate
SR = 16000
# Enforce string types so that datasets can consume them properly
schema = Features(
{
"phonetic_text": Value("string"),
"audio_path": Value("string"),
}
)
dataset = Dataset.from_pandas(df.reset_index(drop=True), features=schema)
dataset = dataset.cast_column("audio_path", Audio(sampling_rate=SR))
2. Build Vocabulary and Tokenizer¶
Our model treats each IPA character as one token, so it is important that every valid character has a consistent ID. To create a character to ID mapping, we take the imported VALID_IPA_CHARS from the scoring script and map each IPA character to an index for the tokenizer.
Special tokens are added to the set of IPA characters:
|replaces spaces as the word delimiter[UNK]is a fallback token that maps any character not in VALID_IPA_CHARS to a single token ID[PAD]is the padding token used to pad sequences to equal length in batches
This mapping is saved so we can reuse the same token IDs when we initialize the tokenizer now and when we load the model for inference later.
# VALID_IPA_CHARS contains the following IPA characters:
print(*VALID_IPA_CHARS)
unk_tok = "[UNK]"
pad_tok = "[PAD]"
space_tok = "|"
all_toks = sorted([char for char in VALID_IPA_CHARS if char != " "]) + [
unk_tok,
pad_tok,
space_tok,
]
vocab_dict = {char: idx for idx, char in enumerate(all_toks)}
vocab_path = DATA_ROOT / "vocab/phonetic_vocab.json"
vocab_path.parent.mkdir(parents=True, exist_ok=True)
with vocab_path.open("w") as f:
json.dump(vocab_dict, f)
Next, we initialize a tokenizer to convert text labels to token IDs. The tokenizer reads the vocabulary mapping we just saved, so each IPA character and special token uses the same ID during training and inference. We also initialize the feature extractor, which converts raw 16 kHz waveforms into model-ready input values for Wav2Vec2.
Finally, the Wav2Vec2Processor is initialized. The processor combines the feature extractor and tokenizer in one object, so we can consistently preprocess audio and encode/decode labels.
tokenizer = Wav2Vec2CTCTokenizer(
str(vocab_path), unk_token=unk_tok, pad_token=pad_tok, word_delimiter_token=space_tok
)
# Create Wav2Vec2 Feature Extractor
feature_extractor = Wav2Vec2FeatureExtractor(
feature_size=1,
sampling_rate=SR,
padding_value=0.0,
do_normalize=True,
return_attention_mask=False,
)
# Create processor (combines tokenizer and feature extractor)
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
3. Create Data Collator¶
Data collators prepare training batches from variable-length sequences. In our model, since audio clips have different lengths, the data collator adds padding to shorter clips so all clips in a batch have the same length. It marks padded positions in the labels with -100, which tells the training algorithm to ignore those positions when calculating loss.
@dataclass
class DataCollatorCTCWithPadding:
"""
Data collator that will dynamically pad the inputs received.
"""
processor: Wav2Vec2Processor
padding: Union[bool, str] = True
max_length: Optional[int] = None
max_length_labels: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
pad_to_multiple_of_labels: Optional[int] = None
def __call__(
self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
) -> Dict[str, torch.Tensor]:
input_features = [
{"input_values": feature["input_values"]} for feature in features
]
label_features = [{"input_ids": feature["labels"]} for feature in features]
batch = self.processor.pad(
input_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
labels_batch = self.processor.tokenizer.pad(
label_features,
padding=self.padding,
max_length=self.max_length_labels,
pad_to_multiple_of=self.pad_to_multiple_of_labels,
return_tensors="pt",
)
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(
labels_batch.attention_mask.ne(1), -100
)
batch["labels"] = labels
return batch
# Initialize data collator
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)
4. Preprocess Audio¶
We define a preprocessing function that extracts features and tokenizes the phonetic text labels using the Wav2Vec2Processor object we already created. The function replaces spaces with word delimiters before tokenization to prepare labels for training.
def preprocess_batch(examples):
# Start by loading the audio and processing with the feature extractor
processed_batch = {
"input_values": [
processor(item["array"], sampling_rate=SR).input_values[0]
for item in examples["audio_path"]
]
}
# Replace spaces with word delimiter and tokenize for CTC
processed_batch["labels"] = [
processor(text=ex.replace(" ", "|")).input_ids for ex in examples["phonetic_text"]
]
return processed_batch
We apply this preprocessing across the dataset in parallel and save the results to disk for faster development iteration.
PROCESSED_DATASET_DIR = DATA_ROOT / "processed" / "phonetic_dataset"
if PROCESSED_DATASET_DIR.exists():
processed_dataset = load_from_disk(str(PROCESSED_DATASET_DIR))
print(
f"Loaded preprocessed dataset from {PROCESSED_DATASET_DIR.relative_to(PROJECT_ROOT)} ({len(processed_dataset)} examples)"
)
else:
processed_dataset = dataset.map(preprocess_batch, batched=True, num_proc=4)
processed_dataset.save_to_disk(str(PROCESSED_DATASET_DIR))
print(
f"Preprocessed and saved dataset to {PROCESSED_DATASET_DIR.relative_to(PROJECT_ROOT)}"
)
5. Model Configuration and Training Setup¶
We load the pretrained Wav2Vec2-base model with a CTC architecture and initialize a new CTC head configured for our phonetic vocabulary. CTC enables direct audio-to-phone prediction without requiring explicit alignment between audio frames and individual characters. This simplifies our setup because we don't have to label exactly when each phone occurs in the audio, rather we just have to specify the sequence of phones.
The feature extractor is frozen to preserve audio processing learned from the model pretraining; we lock those weights and don't update them during training. We will only fine-tune the transformer encoder (which learns acoustic patterns) and the CTC head (which converts these patterns into phone predictions).
# Load pretrained Wav2Vec2 model
model = Wav2Vec2ForCTC.from_pretrained(
"facebook/wav2vec2-base",
ctc_loss_reduction="mean",
ctc_zero_infinity=True, # Replace inf CTC loss with 0 to prevent NaN gradients
pad_token_id=processor.tokenizer.pad_token_id,
ignore_mismatched_sizes=True,
vocab_size=len(processor.tokenizer),
)
# Freeze feature extractor layers
model.freeze_feature_encoder()
We then filter out a few utterances that violate CTC's length constraints. This violation occurred because, for some utterances, there were too many tokens for the audio time. Wav2Vec2 downsamples audio by 320x, so the output sequence length must exceed the label length to avoid infinite loss values.
# Filter out samples that violate the CTC constraint:
# Wav2Vec2 downsamples audio by 320x, so input_length // 320 must be > label_length.
# Samples violating this produce inf CTC loss -> NaN gradients.
WAV2VEC2_DOWNSAMPLE = 320
before_filter = len(processed_dataset)
def is_valid_ctc_sample(example):
input_len = len(example["input_values"])
label_len = len(example["labels"])
# CTC requires: output_timesteps > label_length (including blanks)
output_timesteps = input_len // WAV2VEC2_DOWNSAMPLE
return output_timesteps > label_len and label_len > 0 and input_len > 0
processed_dataset = processed_dataset.filter(is_valid_ctc_sample, num_proc=4)
print(
f"CTC filter: {before_filter} -> {len(processed_dataset)} samples ({before_filter - len(processed_dataset)} removed)"
)
The Trainer expects separate training and validation datasets, so we can simply split our processed_dataset such that 90% goes to training and 10% goes to validation.
# Split dataset into train and validation
dataset_split = processed_dataset.train_test_split(test_size=0.1, shuffle=True, seed=42)
train_dataset = dataset_split["train"]
eval_dataset = dataset_split["test"]
While the loss function drives the actual training and weight updates, the final metric upon which model inferences will be evaluated is Character Error Rate ("CER"). CER measures the edit distance between predicted and reference phonetic sequences at the character level. In training, whenever we compute validation loss, we also calculate the CER so that we can monitor training progress on a human-interpretable metric and select the best model checkpoint.
To compute CER, we run the model on the validation dataset to generate phone predictions, then compare them to the ground truth labels. The score_ipa_cer function is taken directly from the runtime repository. Please note that the score_ipa_cer function normalizes the prediction and reference text before computing the CER.
def compute_metrics(pred):
"""Compute Character Error Rate (CER) for phonetic transcription."""
pred_logits = pred.predictions
pred_ids = np.argmax(pred_logits, axis=-1)
# Replace -100 with pad_token_id for decoding
pred.label_ids[pred.label_ids == -100] = processor.tokenizer.pad_token_id
# Decode predictions and labels
pred_str = processor.batch_decode(pred_ids)
# Don't group tokens when computing metrics (important for CTC)
label_str = processor.batch_decode(pred.label_ids, group_tokens=False)
return {"cer": score_ipa_cer(label_str, pred_str)}
Finally, the training hyperparameters are configured with reasonable starting points, such as:
- learning rate of 5e-5
- batch size of 27 with gradient accumulation
- 20 epochs with a linear warmup and decay learning rate schedule
- evaluation every 1000 steps
The Hugging Face Trainer handles the training loop, checkpointing, and evaluation.
# Define training arguments
output_dir = str(PROJECT_ROOT / "models" / "wav2vec2-phonetic")
training_args = TrainingArguments(
output_dir=output_dir,
group_by_length=False,
per_device_train_batch_size=27,
per_device_eval_batch_size=27,
gradient_accumulation_steps=2,
max_grad_norm=1.0,
learning_rate=5e-5,
num_train_epochs=20,
weight_decay=0.01,
eval_strategy="steps",
eval_steps=1000,
save_steps=1000,
logging_steps=100,
warmup_steps=500, # Shorter warmup for fewer epochs
lr_scheduler_type="linear",
bf16=True,
fp16=False,
gradient_checkpointing=False,
dataloader_num_workers=8,
dataloader_pin_memory=True,
save_total_limit=2,
metric_for_best_model="cer",
greater_is_better=False,
load_best_model_at_end=True,
report_to="none",
)
# Initialize trainer
trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
processing_class=processor,
)
6. Train the Model¶
Let's fine-tune the pre-trained model to predict phones from child speech!
trainer.train()
7. Evaluate and Test Inference¶
Now it's time to assess how well our final fine-tuned model performs. We evaluate on the validation set to compute the overall CER, then run inference on random samples to inspect individual predictions.
# Evaluate on validation set
eval_results = trainer.evaluate()
print("Evaluation Results:")
print(f" CER: {eval_results['eval_cer']:.4f}")
print(f" Loss: {eval_results['eval_loss']:.4f}")
Our model results in a CER of .33 on the validation set.
# Run inference on a few samples
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Get a few random samples
num_samples = 5
sample_indices = random.sample(range(len(eval_dataset)), num_samples)
print("Sample predictions:")
print("=" * 80)
for idx in sample_indices:
sample = eval_dataset[idx]
# Prepare input
input_values = torch.tensor(sample["input_values"]).unsqueeze(0).to(device)
# Get prediction
with torch.no_grad():
logits = model(input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
# Decode
pred_str = processor.decode(pred_ids[0])
label_str = processor.decode(sample["labels"], group_tokens=False)
print(f"\nSample {idx}:")
print(f" Ground truth: {label_str}")
print(f" Prediction: {pred_str}")
# Calculate CER for this sample
cer = score_ipa_cer([label_str], [pred_str])
print(f" CER: {cer:.4f}")
print("\n" + "=" * 80)
Finally, we save the model, processor, and training configuration to disk.
# Save model + processor for reuse on another machine
save_dir = PROJECT_ROOT / "models" / "wav2vec2-phonetic-final"
save_dir.mkdir(parents=True, exist_ok=True)
# Trainer saves model + config + tokenizer config if provided
trainer.save_model(str(save_dir))
# Save processor artifacts (tokenizer + feature extractor)
processor.save_pretrained(str(save_dir))
processor.feature_extractor.save_pretrained(str(save_dir))
torch.save(training_args, save_dir / "training_args.pt")
Step 3: Make your submission¶
Since this is a code execution competition, we will submit our model weights and inference code rather than predictions. The platform runs your main.py in a container, which must load your model and output the file submission/submission.jsonl. See the code submission format webpage for more information.
The general steps to follow:
- Develop inference code
- Test your submission locally
- Package submission
- Make a smoke test submission
- Once you have successfully debugged your submission, submit it for scoring on the full test set!
Develop Inference Code¶
We need to set up a repository with a main.py Python script which performs inference in the competition execution environment and writes our predictions to the required output file. During code execution, our submission will be unzipped and run in the cloud compute cluster. The container will run your main.py script.
Our code must write a JSON Lines (JSONL) file containing one prediction per utterance.
Each line must include:
utterance_idphonetic_text: UTF-8, International Phonetic Alphabet (IPA) transcription of the utterance.
The submission should be written to ./submission/submission.jsonl relative to the working directory.
See more details in the code submission format webpage and in the example submission.
In our main.py, we load the fine-tuned Wav2Vec2 model and processor from the saved checkpoint directory, then run inference on all test utterances in batches. The script reads audio file paths from the test manifest, processes the audio through the model, decodes the model outputs to phonetic predictions, and writes the predicted phonetic transcriptions to the submission file in the required format. We batch utterances to improve GPU memory efficiency during processing.
See phonetic_submission/main.py for the details.
Test Submission Locally¶
You should first and foremost test your submission locally. This is a great way to work out any bugs and ensure that your model performs inference successfully. See the runtime repository's README for further instructions.
This repository provides a useful justfile command to run the trained model on a few sample files.
test-phonetic:
uv run phonetic_submission/main.py models/wav2vec2-phonetic-final/ data-demo/phonetic/utterance_metadata.jsonl
Package Submission¶
Now we will package up our model and inference code into a zip file for predicting on the test set in the runtime container. This repository provides a justfile command to do this. The command creates a zip file combining the trained Wav2Vec2 model with /phonetic_submission/main.py.
pack-phonetic:
rm -f phonetic_submission.zip && \
(cd phonetic_submission && zip -r ../phonetic_submission.zip main.py) && \
(cd models && zip -r ../phonetic_submission.zip wav2vec2-phonetic-final/)
Make a Smoke Test Submission¶
We provide a "smoke test" environment that replicates the test inference runtime but runs only on a small set of audio files. In the smoke test runtime, data/ contains 3,000 audio files from the training set.
Let's submit our submission.zip to a smoke test on the platform.

After hitting "Submit" you can see the job in the queue—it will progress from "Uploading" to "Pending" to "Starting" to "Running" to "Scoring":

Once your submission reaches "Completed", head on over to the "Submissions" tab to see your smoke test score.
Submit!¶
After you've made sure a smoke test submission runs without error, you're ready to submit the real deal! This fine-tuned Wav2Vec2 model results in a .3460 CER on the full test set.

We encourage you to also be mindful of the submission limit (3 per 7 days at most) and others' code jobs. Canceled jobs do not count against the submission limit.
If you want to share any of your findings or have questions, feel free to post on the community forum.
Good luck!