September 1, 2024

Speech Recognition Models are Strong Lip-readers

Key Points

Key points are not available for this paper at this time.

Abstract

In this work, we show that a large pre-trained ASR model can be adapted to perform lip-reading. Our method enables an ASR model like Whisper to interpret lip movements in a video and output text transcriptions. We achieve this by learning a cross-modal mapping from a lip sequence to a speech sequence, allowing a pre-trained ASR model to directly perform lip-reading. The mapping can be learnt simply by backpropagating the cross-entropy loss on the text labels through the pre-trained, frozen ASR model. We achieve an impressive gain of 5.7 WER in the low data regime on the LRS3 benchmark over previous lip-reading methods. Finally, we demonstrate that the same strategy can be extended to other visual speech tasks, such as identifying the spoken language in silent videos.

KI fragen

Bookmark

Cite This Study

Prajwal et al. (Sun,) studied this question.

synapsesocial.com/papers/68e59d79b6db6435875378f4 https://doi.org/https://doi.org/10.21437/interspeech.2024-2290

KI fragen

Bookmark