March 18, 2024Open Access

Multi-Modality Speech Recognition Driven by Background Visual Scenes

Key Points

Key points are not available for this paper at this time.

Abstract

Visual information is often used as a complementary cue for automatic speech recognition in noisy environments. Most previous studies utilize visual information of target speakers (e.g., lip movements) to improve the recognition performance of audio-visual speech recognition (AVSR) models. However, it remains unclear whether visual information of background sound can benefit automatic speech recognition. Our study proceeds in this regard by constructing a new audiovisual dataset and devising an AVSR model. The new dataset, Audio-Visual Natural Scenes (abbreviated as AVNS) dataset, consists of 11 types of natural scenes (around 31.3 hours) and was recorded through professional recording devices. The AVNS dataset provides audio and visual signals of common background noises in natural acoustic scenes. The AVSR model was designed based on a representation learning framework called AV-HuBERT, which could fuse representations of audio and visual modalities for automatic speech recognition. In this work, we combined the AVNS dataset (providing background sound) with the largest benchmark LRS3 dataset (providing target speech) to create adverse noise conditions for the AVSR model. The results showed that incorporating visual information synchronized with background noises greatly improved model performance (reducing WER by up to 4.9%) in noisy environments. These findings demonstrate that noise-related visual information can contribute to model performance in automatic speech recognition.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper