by
Katie Wetstone
Unlike the well-established ecosystem for image processing, working with audio data often feels like navigating uncharted territory — especially when developing acoustic features for machine learning.
Image data are widely used in machine learning to support everything from wildlife research to cancer detection. These applications are all enabled by a strong ecosystem of open-source Python packages for working with image data. Packages like rasterio
and pydicom
make it possible for data scientists to contribute without becoming experts in satellites or medical imagery.
The possibilities for using audio data are similarly vast, but the open-source ecosystem for loading data and extracting features for machine learning is not as well developed. Recently, we at DrivenData worked with speech data to study Alzheimer's detection and student-teacher relationship building in the classroom. We found the available tools hard to navigate without subject matter expertise, and want to share what we learned.
In this post, we provide an overview of open-source Python packages for extracting features from speech audio data. If you want to use voice data for machine learning, but don't know the difference between the glottis and the larynx, this is for you! We hope to help other data scientists unlock the potential of acoustic data without becoming otolaryngologists.
TLDR¶
So what package should I use?
Based on what is available as of April 2025, we generally recommend openSMILE for extracting traditional acoustic features because it is easy to use and has good coverage. We recommend transformers for creating vector embeddings if model explainability and interpretation are not as important. If you are studying music rather than speech, your best bet may be librosa.
The packages covered in this post are:
- openSMILE: Easily extracts a comprehensive set of features
- librosa: Well-maintained package but more focused on music
- transformers: Generates highly predictive embeddings using pre-trained models
- senselab: Promising package under development with broad coverage of speech processing tasks
- parselmouth: Incorporates significant subject matter expertise and coverage, but difficult to use in Python
- Honorable mentions: torchaudio, DisVoice, and Speech_Analysis
Note that we focus only on packages for speech data feature extraction, rather than packages for working with audio data more generally.
Audio data basics¶
What does audio data actually look like?
Sound is generated when something vibrates and causes pressure variations in the air around us. In the case of human speech, sound is generated by the vibration of vocal cords. Sound waves can be visually represented as time versus amplitude. When we load audio data into Python, what we generally get is an array of amplitude at different points in time. From this, we can derive things like frequency.
The audio data that we get in Python doesn't just depend on the sound itself, it also depends on the sampling rate used to create the recording and to load in the data. Sampling rate is the number of recorded values per unit of time. For example, a recording with a sampling rate of 32 kHz (kilohertz) has 32,000 samples per second. A higher sampling rate better approximates the actual real-world sound wave. We can load in sound at its original sampling rate, or downsample to a lower rate.
Voice analysis crash course¶
By voice analysis, what we mean is studying not the actual words that someone says, but how their body physically produces those sounds. This is also called phonetics or acoustic phonetics. Below is a quick intro to some key terms that are useful to know when navigating the options for extracting features from voice data.
- Fundamental frequency (F0): Frequency that vocal folds vibrate to make sound. This relates to perceived pitch, or how high or low a voice sounds.
- Jitter: Variation in fundamental frequency.
- Intensity: Strength or amplitude of sound waves, which influences the perception of volume and emphasis. Also sometimes called activation.
- Shimmer: Variation in signal amplitude.
- Mel-frequency cepstrum coefficients (MFCCs): MFCCs help identify all of the different frequency wavelengths that contribute to construct a more complex sound. They generally describe the "timbre" of someone's voice. MFCCs are calculated by performing a Fourier decomposition on a sound to break down the waveform into simpler frequency combinations, and then creating a "heatmap" to show the contribution of each frequency wave at a given time.
- MFCCs are the result of a lot of complicated calculations. If you want to dive deeper, some helpful resources are Aalto University, Emmanual Deruty, and Tanveer Singh.
- Zero-crossing rate: A measure of how quickly a voice signal oscillates, which is perceived as "smoothness".
Now that we all know a little bit about speech waveforms, back to Python!
Open-source packages¶
While some of the packages below overlap with tools for upstream tasks like diarization and speech recognition, this list focuses on extracting features from speech that are useful for machine learning.
openSMILE¶
openSMILE is a Pythonic, easy-to-use toolkit that can generate useful sets of features. It is focused on analyzing both speech and music. Overall, we recommend openSMILE for general ML applications. Our only caution is that as of now it may not be regularly maintained.
Strengths:
- Very easy to use and well documented. You can extract features with only a few lines of code.
- Comes with reasonable defaults so you don't have to make decisions that require subject matter knowledge
- Extracts a large number of features that are supported by literature. OpenSMILE has a few feature sets it can extract, including the fairly comprehensive eGeMAPS (Geneva Minimalistic Acoustic Parameter Set for Voice Research and Affective Computing).
Weaknesses:
- There is not significant ability to customize how features are extracted.
- The most recent commit was in 2023, so it's not clear whether the package is being continually maintained.
- openSMILE is not available under an open-license for all use cases. It can be used freely for private, research, and educational purposes, but not for commerical products.
Example code to extract the eGeMAPSv02 set of features:
import opensmile
smile = opensmile.Smile(
feature_set = opensmile.FeatureSet.eGeMAPSv02,
feature_level = opensmile.FeatureLevel.LowLevelDescriptors,
)
features_df = smile.process_file("audio_file.wav")
features_df.head()
file | start | end | Loudness_sma3 | F0semitoneFrom27.5Hz_sma3nz | ... | mfcc1_sma3 | jitterLocal_sma3nz | shimmerLocaldB_sma3nz | |
---|---|---|---|---|---|---|---|---|---|
0 | audio_file.wav | 00:00:00.00 | 00:00:00.02 | 0.07 | 42.09 | ... | 20.96 | 0.00 | 1.91 |
1 | audio_file.wav | 00:00:00.01 | 00:00:00.03 | 0.08 | 42.17 | ... | 24.83 | 0.03 | 1.90 |
2 | audio_file.wav | 00:00:00.02 | 00:00:00.04 | 0.09 | 42.33 | ... | 27.19 | 0.03 | 1.09 |
3 | audio_file.wav | 00:00:00.03 | 00:00:00.05 | 0.11 | 42.52 | ... | 28.71 | 0.02 | 1.24 |
4 | audio_file.wav | 00:00:00.04 | 00:00:00.06 | 0.12 | 42.73 | ... | 29.48 | 0.02 | 1.04 |
librosa¶
librosa is a widely used and well-maintained open-source package for audio analysis. It is geared more towards music than speech analysis, so extracts a smaller subset of useful features for speech compared to tools like openSMILE and Parselmouth.
Strengths:
- Well-maintained, well-documented, and relatively easy to use
- Comes with reasonable defaults, while providing more customizability than openSMILE
- Provides useful IO functionality for loading and working with audio files as numpy arrays
Weaknesses:
- Lower coverage of relevant features to study speech. Of the set of traditional acoustic features, it covers MFCCs and zero-crossing rate.
- openSMILE extracts a set of features all at once. In librosa, different features must be extracted separately and then aligned.
Example code to load audio and calculate fundamental frequency and MFCCs:
import librosa
# Load in sound as a numpy array
audio_array, sampling_rate_hz = librosa.load("audio_file.wav")
# Calculate fundamental frequency
# This returns an array with F0s at multiple time points
f0 = librosa.yin(
audio_array,
fmin = librosa.note_to_hz("C2"),
fmax = librosa.note_to_hz("C7"),
sr = sampling_rate_hz,
)
f0
>> array([2205., 2205., 2205., ..., 2205., 2205., 2205.])
# Extract first 3 MFCCs
mfccs = librosa.feature.mfcc(
y = audio_array,
sr = sampling_rate_hz,
n_mfcc = 3
)
mfccs.shape
>> (3, 141320)
mfccs
>> array([[-491., -491., -491., ..., -491., -491., -491.],
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 0., 0., 0.]])
The code above uses the yin
function to extract fundamental frequency limited to the range between C2 and C7. The pyin
function is a very similar alternative, but returns the probability that each frame is voiced in addition to fundamental frequency. The librosa.feature
module includes functionality for other relevant spectral features like MFCCs and zero crossing rate.
Transformers¶
More recently, people have experimented with using pre-trained transformer models to extract useful embeddings of audio data. Traditional acoustic features like pitch or MFCCs are manually engineered based on our understanding of speech. The transformers approach instead relies on algorithms to identify what features are useful for downstream modeling tasks.
The way this often works is to take only the encoder layer from a transformer's encoder-decoder architecture. A few popular open-source models are Whisper, Wave2Vec, and HuBERT. These models can be implemented using Hugging Face's transformers package.
Strengths:
- Uses advanced machine learning to gain predictive power, and can often outperform prediction with traditional acoustic feature sets
- Benefits from sophisticated pre-trained models that have already learned from lots and lots (and lots) of speech data
- Also provides functionality for other parts of the speech processing pipeline, like speech recognition and diarization
Weaknesses:
- Traditional acoustic features were developed to describe physical phenomena of speech (eg. vocal fold movement). The embeddings generated by transformer models are often harder to connect back to real-world interpretations, making models harder to explain.
Example code to extract features from an audio recording using a Whisper model:
import torch
from transformers import WhisperProcessor, WhisperModel
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load pre-trained model
model_checkpoint = 'openai/whisper-tiny'
processor = WhisperProcessor.from_pretrained(model_checkpoint)
model = WhisperModel.from_pretrained(model_checkpoint).to(device)
# Load and preprocess the audio
audio_array, sampling_rate = librosa.load("audio_file.wav", sr = 16000)
inputs = processor(
audio_array,
sampling_rate = sampling_rate,
return_tensors = "pt", # return tensors
return_attention_mask = True,
)
# Extract features
# Disable gradient calculation since we are only doing inference
with torch.no_grad():
encoder_outputs = model.encoder(
inputs.input_features.to(device),
attention_mask = inputs.attention_mask.to(device)
)
# Get the last hidden state from the encoder
features = encoder_outputs.last_hidden_state
features.shape
>> torch.Size([1, 1500, 384])
features
>> tensor([[[ 0.2267, 0.1068, ..., 0.1747, 0.0353],
[-0.5316, -1.1018, ..., 0.4451, 0.5680],
...,
[-0.3217, -0.6013, ..., 0.3448, 0.2963],
[ 0.2157, -0.1431, ..., -0.3734, -0.1372]]])
The code above returns features based on the last hidden state of the model encoder. Credit to DrivenData user avarshn for sharing the tips above in an excellent community code post!
senselab¶
senselab is still under development, but is a promising tool for many components of speech processing, from feature extraction to diarization to speech-to-text.
Strengths:
- Connects to multiple other high-profile tools, and can generate features from openSMILE, parselmouth, and torchaudio in one command
- One-stop shop for end-to-end speech processing, including upstream tasks like diarization, data augmentation, and speech-to-text, as well as downstream tasks like emotion recognition
- Comes with a variety of useful tutorials to demonstrate different use cases
Weaknesses:
- Currently under development, so may change significantly. This means that if you write a pipeline that relies on senselab, there's a higher risk of that pipeline breaking on future updates
- Does not currently support macOS x86-64 due to a dependency on Pytorch 2.2.2+ (does support macOS ARM64)
Parselmouth¶
Parselmouth is Python wrapper for one of the leading pieces of software in speech analysis, Praat. Because of the complexity of Praat, the Python package is fairly difficult to work with. We only recommend using Parselmouth for highly specific use cases and if you already have some subject matter expertise on hand and know what you want to do with the Praat software.
Strengths:
- Highly customizable
- Created by academic experts in phonetic science
- Comprehensively covers all traditional acoustic features, and more
Weaknesses:
- The documentation and code are not well maintained as a standard open-source library, and are extremely hard to navigate. Praat is still primarily a desktop application, and it is not as easy to interact with via Python.
- Does not come with reasonable defaults
- It is difficult, and sometimes not possible, to align different features to the same time scale.
Example code to extract pitch (fundamental frequency) and intensity:
import parselmouth
# Load the sound
sound = parselmouth.Sound("audio_file.wave")
pitch = sound.to_pitch(
time_step = 0.01, pitch_floor = 80, pitch_ceiling = 500
)
# Pitch values and time of values
pitch.selected_array["frequency"], pitch.xs()
>> (array([0., 0., ..., 0., 0.]),
array([0.02, 0.03, ..., 3281.41, 3281.42]))
intensity = sound.to_intensity(
time_step = 0.01, minimum_pitch = 80
)
# Intensity values and time of values
intensity.values, intensity.xs()
>> (array([[-300., -300., ..., -300., -300.]]),
array([0.04, 0.05, ..., 3281.39, 3281.40]))
The above code extracts pitch and intensity limited to the frequency range of 80 to 500 Hz at a time step of 0.01 seconds. Note that the time intervals do not always align between pitch and intensity.
Honorable mentions¶
-
Both DisVoice and Speech_Analysis contain extremely relevant code with good coverage of traditional acoustic features. They are honorable mentions because they do not appear as actively maintained as other packages here.
-
torchaudio does exactly what it sounds like: apply PyTorch to audio data. It does not incorporate as much expertise from the field of phonetics, but can extract useful things like speech quality, intelligibility, and waveforms.
Image credit: Hałas, Magdalena & Maj, Michal & Guz, Ewa & Stencel, Marcin & Cieplak, Tomasz. (2024). Advanced emotion analysis: harnessing facial image processing and speech recognition through deep learning. Journal of Modern Science. 57. 388-401. 10.13166/jms/191163.