⌘+K

December 21, 2018Open Access

Deep Audio-Visual Speech Recognition

Key Points

Key points are not available for this paper at this time.

Abstract

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) we compare two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss. Both models are built on top of the transformer self-attention architecture; (2) we investigate to what extent lip reading is complementary to audio speech recognition, especially when the audio signal is noisy; (3) we introduce and publicly release a new dataset for audio-visual speech recognition, LRS2-BBC, consisting of thousands of natural sentences from British television. The models that we train surpass the performance of all previous work on a lip reading benchmark dataset by a significant margin.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Triantafyllos Afouras

Meta (United States)

Joon Son Chung

Korea Advanced Institute of Science and Technology

Andrew Senior

Google (United States)

Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence

Actions

Institutions

University of Oxford

DeepMind (United Kingdom)

Google (United Kingdom)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Afouras et al. (Fri,) studied this question.

synapsesocial.com/papers/69dd5cf52f737f012599bcfe — DOI: https://doi.org/10.1109/tpami.2018.2889052

Also consider

Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context:

Very Deep Convolutional Networks for Large-Scale Image Recognition· 2014 · 75,538 citations
RELIABLE TRANSITION DETECTION IN VIDEOS: A SURVEY AND PRACTITIONER'S GUIDE· 2001 · 275 citations
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups· 2012 · 10,299 citations
Speaker identification on the SCOTUS corpus· 2008 · 609 citations

Also consider

Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context:

Very Deep Convolutional Networks for Large-Scale Image Recognition· 2014 · 75,538 citations
RELIABLE TRANSITION DETECTION IN VIDEOS: A SURVEY AND PRACTITIONER'S GUIDE· 2001 · 275 citations
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups· 2012 · 10,299 citations
Speaker identification on the SCOTUS corpus· 2008 · 609 citations

Deep Audio-Visual Speech Recognition

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider