The extraction of speaker-related features through speaker representations or utterance embeddings has been extensively studied for years. Convolutional Neural Networks (CNNs) are the most commonly used backbone network for feature extraction, and deep Residual Networks (ResNets), which incorporate residual connectivity into CNNs, achieve effective frame-level feature extraction inside the same input samples. ResNets have demonstrated impressive performance as speaker embedding extractors. With emerging this backbone and its advanced extensions, such as Res2Net, as dominant backbone networks in speaker recognition, and despite its widespread adoption, a significant gap in current speaker embeddings is the unmodified use of these successful backbones from the visual domain, without considering the specific adaptation requirements for the speech field. Tailoring the backbone to account for the unique characteristics of speech representations ensures effective learning in Automatic Speaker Verification (ASV) systems. This thesis concentrates on the backbone network, the fundamental and most critical component of Deep Neural Networks (DNNs), responsible for extracting features from input data and mapping them into distinct feature representations. One of the key questions addressed in the thesis is the following: Which backbone is most robust and yields the most generalized speaker representation, with the potential to be applied to many existing ASV systems?. Through a strategic evaluation, the study analyzed the performance of integrating different residual-based backbones into the most popular ResNet-based fundamental baselines, ResNet-34 and ECAPA-TDNN based models, with highly different model complexity and CNN stem representations. To this end, the proposed architectures were evaluated on the publicly available VoxCeleb dataset. This thesis introduces an innovative approach to speaker modeling by incorporating scale and cardinality dimensions for the first time. Experimental findings reveal that utilizing multi-scale and multi-branch aggregated residual transformations within the Res2NeXt backbone yields considerable performance improvements. Notably, 6s8g and 8s8g variants of Res2NeXt, prove particularly beneficial in improving the representational capability of learned features for speaker embedding. The consistent superiority, excellent compatibility, and robustness of Res2NeXt compared to the current popular ResNet and Res2Net backbones across various architectures and conditions demonstrate its broad applicability and effectiveness as a backbone in speaker modeling. To further evaluate the Robustness and Generalization (R&G) capabilities of the proposed speaker embeddings, the thesis expanded its scope to cross-domain applications, specifically Speech Emotion Recognition (SER), one of the most challenging downstream tasks. A comparative analysis of the proposed speaker embeddings was conducted using two benchmark datasets, IEMOCAP and CREMA-D. The overarching goal of the thesis is to contribute to the development of general-purpose and universal speech representations that generalize effectively across multiple speech-related tasks. By examining the alignment and divergence of embeddings for ASV and SER, the study seeks to answer a fundamental question: Can a single representation framework cater to both speaker verification and emotion recognition or are task-specific embeddings inevitable?. The results confirm that speaker embeddings derived from the Res2NeXt backbone are also highly effective for the SER task, maintaining their performance across diverse evaluation protocols and data variations. This demonstrates the generalizability of Res2NeXt embeddings in the cross-domain task. The study attributes their success to the ability of Res2NeXt-based models to capture rich speech features, including tone, pitch, and phonation patterns, during speaker recognition training—features that are also vital for emotion recognition.
Razieh Khamsehashari (Thu,) studied this question.