June 21, 2024Open Access

Indian Cross Corpus Speech Emotion Recognition Using Multiple Spectral-Temporal-Voice Quality Acoustic Features and Deep Convolution Neural Network

Key Points

Key points are not available for this paper at this time.

Abstract

Speech Emotion Recognition (SER) is very crucial in enriching next generation human machine interaction (HMI) with emotional intelligence capabilities by extracting the emotions from words and voice.However, current SER techniques are developed within the experimental boundaries and faces major challenges such as lack of robustness across languages, cultures, age gaps and gender of speakers.Very little work is carried out for SER for Indian corpus which has higher diversity, large number of dialects, vast changes due to regional and geographical aspects.India is one of the largest customers of HMI systems, social networking sites and internet users, therefore it is crucial for SER that focuses on Indian corpuses.This paper presents, cross corpus SER (CCSER) for Indian corpus using multiple acoustic features (MAF) and deep convolution neural network (DCNN) to improve the robustness of the SER.The MAF consists of various spectral, temporal and voice quality features.Further, Fire Hawk based optimization (FHO) technique is utilized for the salient feature selection.The FHO selects the important features from MAF to minimize the computational complexity and improve feature distinctiveness based in inter class and inter class variance of the features.The DCNN algorithm provides the better correlation, higher feature representation, better description of variation in timbre, intonation and pitch, superior connectivity in global and local features of the speech signal to characterize the corpus.The outcomes of suggested DCNN based SER is evaluated on Indo-Aryan language family (Hindi and Urdu) and Dravidian Language family (Telugu and Kannada).The proposed scheme results in improved accuracy for the various cross corpus and multilingual SER and out performs the traditional techniques.It provides an accuracy of 58.83%, 61.75%, 69.75% and 45.51% for Hindi, Urdu, Telugu and Kannada language for multilingual training.

Indian Cross Corpus Speech Emotion Recognition Using Multiple Spectral-Temporal-Voice Quality Acoustic Features and Deep Convolution Neural Network

Key Points

Abstract

Cite This Study