Voice perception has often been modeled within a multidimensional acoustic space, yet the stability of its underlying dimensions across linguistic levels (e.g., syllables, words, sentences) remains unclear. This study investigated whether the acoustic basis of voice discrimination remains stable across the linguistic levels of words and sentences. Mandarin-speaking listeners judged speaker identity (same/different) for word and sentence pairs with controlled similarity in a classic voice space. Interpretable machine learning classifiers were trained on acoustic differences between speech pairs to predict listeners’ judgments. Models performed well within levels, but generalization across levels declined and was asymmetric: sentence-trained models tended to overproduce different-speaker responses when tested on word data, whereas word-trained models showed a bias toward same-speaker responses when tested on sentence data. Feature importance rankings converged on a shared acoustic scaffold (spectral balance, formant structure, higher harmonics, and pitch) but highlighted the dominant role of temporal–prosodic cues (speech rate) and the growing relevance of variability-based cues for sentences compared with words. These findings refine prototype-based models of voice perception by demonstrating adaptive cue weighting across linguistic structure and offer implications for voice perception in naturalistic contexts. Methodologically, this study addressed an issue in the literature concerning the computation of variability measures for harmonic spectral features.
Xu et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: