blog resources

Open-source packages for using speech data in ML

Overview of key open-source packages for extracting features from voice data to support ML applications

Katie Wetstone
Senior Data Scientist

Unlike the well-established ecosystem for image processing, working with audio data often feels like navigating uncharted territory — especially when developing acoustic features for machine learning.

Image data are widely used in machine learning to support everything from wildlife research to cancer detection. These applications are all enabled by a strong ecosystem of open-source Python packages for working with image data. Packages like rasterio and pydicom make it possible for data scientists to contribute without becoming experts in satellites or medical imagery.

The possibilities for using audio data are similarly vast, but the open-source ecosystem for loading data and extracting features for machine learning is not as well developed. Recently, we at DrivenData worked with speech data to study Alzheimer's detection and student-teacher relationship building in the classroom. We found the available tools hard to navigate without subject matter expertise, and want to share what we learned.

In this post, we provide an overview of open-source Python packages for extracting features from speech audio data. If you want to use voice data for machine learning, but don't know the difference between the glottis and the larynx, this is for you! We hope to help other data scientists unlock the potential of acoustic data without becoming otolaryngologists.

TLDR

So what package should I use?

Based on what is available as of April 2025, we generally recommend openSMILE for extracting traditional acoustic features because it is easy to use and has good coverage. We recommend transformers for creating vector embeddings if model explainability and interpretation are not as important. If you are studying music rather than speech, your best bet may be librosa.

The packages covered in this post are:

  • openSMILE: Easily extracts a comprehensive set of features
  • librosa: Well-maintained package but more focused on music
  • transformers: Generates highly predictive embeddings using pre-trained models
  • senselab: Promising package under development with broad coverage of speech processing tasks
  • parselmouth: Incorporates significant subject matter expertise and coverage, but difficult to use in Python
  • Honorable mentions: torchaudio, DisVoice, and Speech_Analysis

Note that we focus only on packages for speech data feature extraction, rather than packages for working with audio data more generally.

Audio data basics

What does audio data actually look like?

Sound is generated when something vibrates and causes pressure variations in the air around us. In the case of human speech, sound is generated by the vibration of vocal cords. Sound waves can be visually represented as time versus amplitude. When we load audio data into Python, what we generally get is an array of amplitude at different points in time. From this, we can derive things like frequency.

The audio data that we get in Python doesn't just depend on the sound itself, it also depends on the sampling rate used to create the recording and to load in the data. Sampling rate is the number of recorded values per unit of time. For example, a recording with a sampling rate of 32 kHz (kilohertz) has 32,000 samples per second. A higher sampling rate better approximates the actual real-world sound wave. We can load in sound at its original sampling rate, or downsample to a lower rate.

Voice analysis crash course

By voice analysis, what we mean is studying not the actual words that someone says, but how their body physically produces those sounds. This is also called phonetics or acoustic phonetics. Below is a quick intro to some key terms that are useful to know when navigating the options for extracting features from voice data.

  • Fundamental frequency (F0): Frequency that vocal folds vibrate to make sound. This relates to perceived pitch, or how high or low a voice sounds.
    • Jitter: Variation in fundamental frequency.
  • Intensity: Strength or amplitude of sound waves, which influences the perception of volume and emphasis. Also sometimes called activation.
    • Shimmer: Variation in signal amplitude.
  • Mel-frequency cepstrum coefficients (MFCCs): MFCCs help identify all of the different frequency wavelengths that contribute to construct a more complex sound. They generally describe the "timbre" of someone's voice. MFCCs are calculated by performing a Fourier decomposition on a sound to break down the waveform into simpler frequency combinations, and then creating a "heatmap" to show the contribution of each frequency wave at a given time.
  • Zero-crossing rate: A measure of how quickly a voice signal oscillates, which is perceived as "smoothness".

Now that we all know a little bit about speech waveforms, back to Python!

Open-source packages

While some of the packages below overlap with tools for upstream tasks like diarization and speech recognition, this list focuses on extracting features from speech that are useful for machine learning.

openSMILE

openSMILE is a Pythonic, easy-to-use toolkit that can generate useful sets of features. It is focused on analyzing both speech and music. Overall, we recommend openSMILE for general ML applications. Our only caution is that as of now it may not be regularly maintained.

Strengths:

  • Very easy to use and well documented. You can extract features with only a few lines of code.
  • Comes with reasonable defaults so you don't have to make decisions that require subject matter knowledge
  • Extracts a large number of features that are supported by literature. OpenSMILE has a few feature sets it can extract, including the fairly comprehensive eGeMAPS (Geneva Minimalistic Acoustic Parameter Set for Voice Research and Affective Computing).

Weaknesses:

  • There is not significant ability to customize how features are extracted.
  • The most recent commit was in 2023, so it's not clear whether the package is being continually maintained.
  • openSMILE is not available under an open-license for all use cases. It can be used freely for private, research, and educational purposes, but not for commerical products.

Example code to extract the eGeMAPSv02 set of features:

import opensmile

smile = opensmile.Smile(
    feature_set = opensmile.FeatureSet.eGeMAPSv02,
    feature_level = opensmile.FeatureLevel.LowLevelDescriptors,
)
features_df = smile.process_file("audio_file.wav")

features_df.head()
file start end Loudness_sma3 F0semitoneFrom27.5Hz_sma3nz ... mfcc1_sma3 jitterLocal_sma3nz shimmerLocaldB_sma3nz
0 audio_file.wav 00:00:00.00 00:00:00.02 0.07 42.09 ... 20.96 0.00 1.91
1 audio_file.wav 00:00:00.01 00:00:00.03 0.08 42.17 ... 24.83 0.03 1.90
2 audio_file.wav 00:00:00.02 00:00:00.04 0.09 42.33 ... 27.19 0.03 1.09
3 audio_file.wav 00:00:00.03 00:00:00.05 0.11 42.52 ... 28.71 0.02 1.24
4 audio_file.wav 00:00:00.04 00:00:00.06 0.12 42.73 ... 29.48 0.02 1.04

librosa

librosa is a widely used and well-maintained open-source package for audio analysis. It is geared more towards music than speech analysis, so extracts a smaller subset of useful features for speech compared to tools like openSMILE and Parselmouth.

Strengths:

  • Well-maintained, well-documented, and relatively easy to use
  • Comes with reasonable defaults, while providing more customizability than openSMILE
  • Provides useful IO functionality for loading and working with audio files as numpy arrays

Weaknesses:

  • Lower coverage of relevant features to study speech. Of the set of traditional acoustic features, it covers MFCCs and zero-crossing rate.
  • openSMILE extracts a set of features all at once. In librosa, different features must be extracted separately and then aligned.

Example code to load audio and calculate fundamental frequency and MFCCs:

import librosa

# Load in sound as a numpy array
audio_array, sampling_rate_hz = librosa.load("audio_file.wav")

# Calculate fundamental frequency
# This returns an array with F0s at multiple time points
f0 = librosa.yin(
    audio_array,
    fmin = librosa.note_to_hz("C2"),
    fmax = librosa.note_to_hz("C7"),
    sr = sampling_rate_hz,
)
f0
>> array([2205., 2205., 2205., ..., 2205., 2205., 2205.])

# Extract first 3 MFCCs
mfccs = librosa.feature.mfcc(
    y = audio_array,
    sr = sampling_rate_hz,
    n_mfcc = 3
)
mfccs.shape
>> (3, 141320)

mfccs
>> array([[-491., -491., -491., ..., -491., -491., -491.],
          [   0.,    0.,    0., ...,    0.,    0.,    0.],
          [   0.,    0.,    0., ...,    0.,    0.,    0.]])

The code above uses the yin function to extract fundamental frequency limited to the range between C2 and C7. The pyin function is a very similar alternative, but returns the probability that each frame is voiced in addition to fundamental frequency. The librosa.feature module includes functionality for other relevant spectral features like MFCCs and zero crossing rate.

Transformers

More recently, people have experimented with using pre-trained transformer models to extract useful embeddings of audio data. Traditional acoustic features like pitch or MFCCs are manually engineered based on our understanding of speech. The transformers approach instead relies on algorithms to identify what features are useful for downstream modeling tasks.

The way this often works is to take only the encoder layer from a transformer's encoder-decoder architecture. A few popular open-source models are Whisper, Wave2Vec, and HuBERT. These models can be implemented using Hugging Face's transformers package.

Strengths:

  • Uses advanced machine learning to gain predictive power, and can often outperform prediction with traditional acoustic feature sets
  • Benefits from sophisticated pre-trained models that have already learned from lots and lots (and lots) of speech data
  • Also provides functionality for other parts of the speech processing pipeline, like speech recognition and diarization

Weaknesses:

  • Traditional acoustic features were developed to describe physical phenomena of speech (eg. vocal fold movement). The embeddings generated by transformer models are often harder to connect back to real-world interpretations, making models harder to explain.

Example code to extract features from an audio recording using a Whisper model:

import torch
from transformers import WhisperProcessor, WhisperModel
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load pre-trained model
model_checkpoint = 'openai/whisper-tiny'
processor = WhisperProcessor.from_pretrained(model_checkpoint)
model = WhisperModel.from_pretrained(model_checkpoint).to(device)

# Load and preprocess the audio
audio_array, sampling_rate = librosa.load("audio_file.wav", sr = 16000)
inputs = processor(
    audio_array,
    sampling_rate = sampling_rate,
    return_tensors = "pt", # return tensors
    return_attention_mask = True,
)

# Extract features
# Disable gradient calculation since we are only doing inference
with torch.no_grad():
    encoder_outputs = model.encoder(
        inputs.input_features.to(device),
        attention_mask = inputs.attention_mask.to(device)
    )
    # Get the last hidden state from the encoder
    features = encoder_outputs.last_hidden_state

features.shape
>> torch.Size([1, 1500, 384])

features
>> tensor([[[ 0.2267, 0.1068, ..., 0.1747, 0.0353],
          [-0.5316, -1.1018, ..., 0.4451, 0.5680],
          ...,
          [-0.3217, -0.6013, ..., 0.3448, 0.2963],
          [ 0.2157, -0.1431, ..., -0.3734, -0.1372]]])

The code above returns features based on the last hidden state of the model encoder. Credit to DrivenData user avarshn for sharing the tips above in an excellent community code post!

senselab

senselab is still under development, but is a promising tool for many components of speech processing, from feature extraction to diarization to speech-to-text.

Strengths:

  • Connects to multiple other high-profile tools, and can generate features from openSMILE, parselmouth, and torchaudio in one command
  • One-stop shop for end-to-end speech processing, including upstream tasks like diarization, data augmentation, and speech-to-text, as well as downstream tasks like emotion recognition
  • Comes with a variety of useful tutorials to demonstrate different use cases

Weaknesses:

  • Currently under development, so may change significantly. This means that if you write a pipeline that relies on senselab, there's a higher risk of that pipeline breaking on future updates
  • Does not currently support macOS x86-64 due to a dependency on Pytorch 2.2.2+ (does support macOS ARM64)

Parselmouth

Parselmouth is Python wrapper for one of the leading pieces of software in speech analysis, Praat. Because of the complexity of Praat, the Python package is fairly difficult to work with. We only recommend using Parselmouth for highly specific use cases and if you already have some subject matter expertise on hand and know what you want to do with the Praat software.

Strengths:

  • Highly customizable
  • Created by academic experts in phonetic science
  • Comprehensively covers all traditional acoustic features, and more

Weaknesses:

  • The documentation and code are not well maintained as a standard open-source library, and are extremely hard to navigate. Praat is still primarily a desktop application, and it is not as easy to interact with via Python.
  • Does not come with reasonable defaults
  • It is difficult, and sometimes not possible, to align different features to the same time scale.

Example code to extract pitch (fundamental frequency) and intensity:

import parselmouth

# Load the sound
sound = parselmouth.Sound("audio_file.wave")

pitch = sound.to_pitch(
    time_step = 0.01, pitch_floor = 80, pitch_ceiling = 500
)
# Pitch values and time of values
pitch.selected_array["frequency"], pitch.xs()
>> (array([0., 0., ..., 0., 0.]),
    array([0.02, 0.03, ..., 3281.41, 3281.42]))

intensity = sound.to_intensity(
    time_step = 0.01, minimum_pitch = 80
)
# Intensity values and time of values
intensity.values, intensity.xs()
>> (array([[-300., -300., ..., -300., -300.]]),
    array([0.04, 0.05, ..., 3281.39, 3281.40]))

The above code extracts pitch and intensity limited to the frequency range of 80 to 500 Hz at a time step of 0.01 seconds. Note that the time intervals do not always align between pitch and intensity.

Honorable mentions

  • Both DisVoice and Speech_Analysis contain extremely relevant code with good coverage of traditional acoustic features. They are honorable mentions because they do not appear as actively maintained as other packages here.

  • torchaudio does exactly what it sounds like: apply PyTorch to audio data. It does not incorporate as much expertise from the field of phonetics, but can extract useful things like speech quality, intelligibility, and waveforms.


Image credit: Hałas, Magdalena & Maj, Michal & Guz, Ewa & Stencel, Marcin & Cieplak, Tomasz. (2024). Advanced emotion analysis: harnessing facial image processing and speech recognition through deep learning. Journal of Modern Science. 57. 388-401. 10.13166/jms/191163.

Stay updated

Join our newsletter or follow us for the latest on our social impact projects, data science competitions and open source work.

There was a problem. Please try again.
Subscribe successful!
Protected by reCAPTCHA. The Google Privacy Policy and Terms of Service apply.

Latest posts

All posts

winners

Meet the winners of Phase 2 of the PREPARE Challenge

Learn about how winners detected cognitive decline using speech recordings and social determinants of health survey data

tutorial

Getting started with LLMs: a benchmark for the 'What's Up, Docs?' challenge

An introduction to using large language models via the benchmark to a document summarization challenge.

winners

Meet the Winners of the Goodnight Moon, Hello Early Literacy Screening Challenge

Learn about the results and winning methods from the early literacy screening challenge.

resources

Where to find a data job for a good cause

Finding data jobs for good causes can be difficult. Learn strategies, job lists, and tips to find organizations with open positions working on causes you care about.

winners

Meet the Winners of the Youth Mental Health Narratives Challenge

Learn about the winning solutions from the Youth Mental Health Challenge Automated Abstraction and Novel Variables Tracks

winners

Meet the winners of the Forecast and Final Prize Stages of the Water Supply Forecast Rodeo

Learn about the winners and winning solutions from the final stages of the Water Supply Forecast Rodeo.

insights

10 takeaways from 10 years of data science for social good

This year DrivenData celebrates our 10th birthday! We've spent the past decade working to use data science and AI for social good. Here are some lessons we've learned along the way.

tutorial

Goodnight Moon, Hello Early Literacy Screening Benchmark

In this guest post from the MIT Gabrieli Lab, we'll show you how to get started with the literacy screening challenge!

tutorial

Youth Mental Health: Automated Abstraction Benchmark

Learn how to process text narratives using open-source LLMs for the Youth Mental Health: Automated Abstraction challenge

winners

Meet the winners of Phase 1 of the PREPARE Challenge

Learn about the top datasets sourced for Phase 1 of the PREPARE Challenge.

resources

Teaching with DrivenData Competitions

Inspiration and resources for teaching students data science, machine learning, and AI skills with DrivenData competitions.

winners

Meet the winners of the Pose Bowl challenge

Learn about the top solutions submitted for the Pose Bowl: Spacecraft Detection and Pose Estimation Challenge.

winners

Meet the winners of the Water Supply Forecast Rodeo Hindcast Stage

Learn about the winning models for forecasting seasonal water supply from the first stage of the Water Supply Forecast Rodeo.

tools

Cookiecutter Data Science V2

Announcing the V2 release of Cookiecutter Data Science, the most widely adopted data science project template.

resources

How to make data science projects more open and inclusive

Key practices from the field of open science for making data science work more transparent, inclusive, and equitable.

winners

Meet the winners of the Kelp Wanted challenge

Dive into the solutions from the super segmenters who best detected kelp in Landsat imagery!

winners

Meet the winners of the SNOMED CT Entity Linking Challenge

Meet the winners with the best systems for detecting clinical terms in medical notes.

winners

Meet the winners of the Pale Blue Dot challenge

Learn about the top visuals created for the Pale Blue Dot: Visualization Challenge and the solvers behind them.

tutorial

NASA Pose Bowl - Benchmark

An introduction to the NASA Pose Bowl competition, with a benchmark solution for the object detection track

tutorial

SNOMED CT Entity Linking Challenge - Benchmark

In this guest post from Veratai, we'll help you get started with the SNOMED CT Entity Linking Challenge!

Work with us to build a better world

Learn more about how our team is bringing the transformative power of data science and AI to organizations tackling the world's biggest challenges.