Key points are not available for this paper at this time.
Focuses on an affine-invariant lipreading method, and its optimal combination with an audio subsystem to implement an audio-visual automatic speech recognition (AV-ASR) system. The lipreading method is based on outer lip contour description which is transformed to the Fourier domain and normalized there to eliminate dependencies on the affine transformation (translation, rotation, scaling, and shear) and on the starting point. The optimal combination algorithm incorporates a signal-to-noise ratio (SNR) based weight selection rule which leads to a more accurate global likelihood ratio test. Experimental results are presented for an isolated word recognition task for eight different noise types from the NOISEX data base for several SNR values.
Gurbuz et al. (Wed,) studied this question.