Traditional speech and speaker recognition systems are typically trained using neutrally phonated datasets, where performance degrades significantly when speech deviates from this neutral state. Variations in vocal effort, ranging from whisper to shout, represent a critical but underexplored challenge for developing robust speech systems. To better address this issue, we investigate how various data augmentation strategies impact performance of models. While prior work on vocal effort classification has relied on traditional acoustic features and limited datasets, our study focuses on leveraging recent self-supervised models for categorical vocal effort recognition. Despite increasing attention, current models exhibit limited reliability in vocal effort classification, underscoring the need to improve modeling approaches. Considering the need for state-of-the-art performance and limited availability of labeled data that adequately covers the full range of vocal effort, data augmentation is needed to improve model generalization. Here, we explore the use of data augmentation to improve robustness in vocal effort classification. We apply a range of augmentation techniques to assess their impact on classification performance across vocal effort categories. Our experiments leverage two vocal effort corpora: VocalEffort-1&2 (VE-1,2), developed by CRSS-UTDallas, and AVID corpus, both spanning diverse vocal efforts. This study aims to uncover limitations and potential of augmentation techniques.
Omidi et al. (Wed,) studied this question.