What question did this study set out to answer?

The aim is to explore the use of audio data in political science and its potential for multimodal analysis.

February 2, 2026Open Access

Potential and Pitfalls of Audio as Data for Political Research: Alignment, Features, and Classification Models

Key Points

The aim is to explore the use of audio data in political science and its potential for multimodal analysis.
Analyzed audio data from U.S. presidential debates from 1960 to 2020
Examined low-level descriptors, MFCCs, and audio embeddings for classification
Developed forced alignment and emotion recognition models
Provided guidelines for applying audio features to political research questions.
Successfully applied forced alignment using MFCCs for transcript time-stamping
Characterized speeches using low-level descriptors
Created effective classification models with audio embeddings
Achieved emotion recognition with Wav2Vec, differentiating emotions and their dominance.

Abstract

Abstract Political science is a field rich in multimodal information sources, from televised debates to parliamentary briefings. This paper bridges a gap between computer and political science in multimodal data analysis using audio. The adoption of multimodal analyses in political science (e.g., video/audio with text-as-data approaches) has been relatively slow due to unequal distribution of computational power and skills needed. We provide solutions to challenges encountered when analyzing audio, advancing the potential for multimodal data analysis in political science. Using a dataset of all televised U.S. presidential debates from 1960 to 2020, we focus on three features encountered when analyzing audio data: low-level descriptors (LLDs), such as pitch or energy; Mel-frequency cepstral coefficients (MFCCs); and audio embeddings/encodings, like Wav2Vec. We showcase four applications: (a) forced alignment of audio text using MFCCs, time-stamping transcripts, and speaker information; (b) speech characterization using LLDs; (c) custom-made classification models with audio embeddings and MFCCs; and (d) emotional recognition models using Wav2Vec for classification of discrete emotions and their valence-arousal dominance. We provide explanations to help understand how these features can be applied for different political research questions and advice on vigilance to naive interpretation, for both experienced researchers and those who want to start working with audio.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Mestre et al. (Fri,) studied this question.

synapsesocial.com/papers/6980fe9bc1c9540dea810d95 https://doi.org/https://doi.org/10.1017/pan.2025.10031

Bookmark

View Full Paper