Visual Speech Recognition (VSR) is crucial in multimodal human-computer interactions for speech interpretation in noisy environments or for users with hearing impairments. Traditional VSR models struggle with temporal unpredictability, speaker-dependent lip movements, and contextual ambiguity. To address these issues, this paper presents the Visual Attention LipNet Network (V-LipNet), which utilizes Particle Swarm Optimization (PSO) to adjust learning rates, convolutional filter sizes, and attention weights. V-LipNet dynamically focuses on lip movement features and captures long-range temporal associations via a self-attention mechanism and spatiotemporal convolutional layers. WER and SA were used to evaluate performance on benchmark datasets, including GRID, TCD-TIMIT, and LRS2. The findings reveal that PSO-optimized V-LipNet reduces WER better than traditional LipNet and LSTM, indicating that it can generalize to unseen speakers and noise. Results reveal that metaheuristic optimization and attention processes work effectively. Ultimately, V-LipNet provides a powerful and adaptable multimodal interface solution for assistive technology, language learning, communication networks, and human-robot interaction.
Ghazal et al. (Thu,) studied this question.