What question did this study set out to answer?

The aim is to assess if voice discrimination cues remain consistent when comparing words and sentences.

June 5, 2026Open Access

Dominant role of temporal–prosodic cues in sentence-level voice discrimination: Insights from interpretable machine learning

Key Points

The aim is to assess if voice discrimination cues remain consistent when comparing words and sentences.
Mandarin-speaking listeners judged speaker identity for controlled word and sentence pairs.
Interpretable machine learning classifiers were used to analyze acoustic differences between speech pairs.
The study measured how well models trained on one level (words or sentences) performed when tested on the other.
Model performance was strong within each linguistic level but declined when generalizing between levels.
Sentence-trained models overproduced different-speaker responses on word data; word-trained models favored same-speaker responses on sentence data.
Key features included spectral balance and temporal-prosodic cues, indicating their significant role in voice perception.

Abstract

Voice perception has often been modeled within a multidimensional acoustic space, yet the stability of its underlying dimensions across linguistic levels (e.g., syllables, words, sentences) remains unclear. This study investigated whether the acoustic basis of voice discrimination remains stable across the linguistic levels of words and sentences. Mandarin-speaking listeners judged speaker identity (same/different) for word and sentence pairs with controlled similarity in a classic voice space. Interpretable machine learning classifiers were trained on acoustic differences between speech pairs to predict listeners’ judgments. Models performed well within levels, but generalization across levels declined and was asymmetric: sentence-trained models tended to overproduce different-speaker responses when tested on word data, whereas word-trained models showed a bias toward same-speaker responses when tested on sentence data. Feature importance rankings converged on a shared acoustic scaffold (spectral balance, formant structure, higher harmonics, and pitch) but highlighted the dominant role of temporal–prosodic cues (speech rate) and the growing relevance of variability-based cues for sentences compared with words. These findings refine prototype-based models of voice perception by demonstrating adaptive cue weighting across linguistic structure and offer implications for voice perception in naturalistic contexts. Methodologically, this study addressed an issue in the literature concerning the computation of variability measures for harmonic spectral features.

Dominant role of temporal–prosodic cues in sentence-level voice discrimination: Insights from interpretable machine learning

Key Points

Abstract

Cite This Study

Also Consider

Also Consider