February 22, 2024

An Extensive Investigation of Deep Learning Techniques for Audio-Visual Speech Recognition

Key Points

Key points are not available for this paper at this time.

Abstract

Audio-visual speech recognition (AVSR) is a dynamic field that has emerged at the intersection of computer vision and voice processing. This paper, indepth, examines the challenges, recent advancements, and potential applications of AVSR technology. In order to obtain more reliable and accurate speech recognition, researchers are trying to understand spoken language by leveraging both visual and aural cues.In the first part of the paper, the fundamentals of AVSR are examined, including research datasets, several recognition models, and feature extraction methods for both visual and aural modalities. This investigation considers both state-of-the-art deep learning methods such as Transformer-based models and traditional methods like Hidden Markov Model and impartially evaluates the merits and limitations of diffrent recognition models. The comparison analysis of assessment metrices makes it clearer which metrics is most suited for assessing how successful an AVSR systems is. Furthermore, the persistent challenges in AVSR-such as speaker variability and noisy environments-are examined and highlighted, emphasising the need for more research in this field.

KI fragen

Bookmark

Cite This Study

Kuriakose et al. (Thu,) studied this question.

synapsesocial.com/papers/68e78323b6db6435876f5b29 https://doi.org/https://doi.org/10.1109/ic-etite58242.2024.10493813

KI fragen

Bookmark