Key points are not available for this paper at this time.
Audio-visual speech recognition (AVSR) is a dynamic field that has emerged at the intersection of computer vision and voice processing. This paper, indepth, examines the challenges, recent advancements, and potential applications of AVSR technology. In order to obtain more reliable and accurate speech recognition, researchers are trying to understand spoken language by leveraging both visual and aural cues.In the first part of the paper, the fundamentals of AVSR are examined, including research datasets, several recognition models, and feature extraction methods for both visual and aural modalities. This investigation considers both state-of-the-art deep learning methods such as Transformer-based models and traditional methods like Hidden Markov Model and impartially evaluates the merits and limitations of diffrent recognition models. The comparison analysis of assessment metrices makes it clearer which metrics is most suited for assessing how successful an AVSR systems is. Furthermore, the persistent challenges in AVSR-such as speaker variability and noisy environments-are examined and highlighted, emphasising the need for more research in this field.
Kuriakose et al. (Thu,) studied this question.