Key points are not available for this paper at this time.
The development of an Automatic Speech Recognition (ASR) system for children has been a significant difficulty because of the substantial inherent heterogeneity in the physical traits, articulation patterns, and mannerisms shown by each individual child. Moreover, the limited availability of substantial quantities of children's speech data may be linked to variances in vocal-tract geometries resulting from anatomical and physiological factors. The present study aims to address the aforementioned issues by conducting a study into the advancement of a voice recognition system specifically designed for children with limited resources. This study utilizes novel methods for extracting heterogeneous features from an input audio signal, which are based on raw as well as central moments. In order to mitigate the problem of limited data availability, this study utilizes different training systems that are developed using perturbation methods. Additionally, the optimization of modeling parameters is done in order to enhance the effectiveness of these models. The findings of these efforts demonstrate a significant improvement in the performance of the system. The use of a hybrid system based on a Deep Neural Network-Hidden Markov Model (DNN-HMM) on fused front end features results in a Relative Improvement of 21.36% compared to other baseline systems.
Bawa et al. (Thu,) studied this question.