What question did this study set out to answer?

The research aims to improve visual speech recognition performance in noisy environments using a new network architecture.

March 23, 2026Open Access

Attention-Based LipNet Architectures for Robust Visual Speech Recognition in Multimodal Interfaces

Key Points

The research aims to improve visual speech recognition performance in noisy environments using a new network architecture.
Developed the Visual Attention LipNet Network (V-LipNet) for robust performance.
Utilized Particle Swarm Optimization to optimize learning rates and filter sizes.
Incorporated a self-attention mechanism to focus on relevant lip movement features.
Evaluated performance using word error rate (WER) and speaker accuracy (SA) on multiple benchmark datasets.
V-LipNet significantly reduced word error rates compared to traditional LipNet and LSTM models.
Demonstrated enhanced generalization to unseen speakers and background noise.
Metaheuristic optimization and attention processes were effective in improving visual speech recognition.

Abstract

Visual Speech Recognition (VSR) is crucial in multimodal human-computer interactions for speech interpretation in noisy environments or for users with hearing impairments. Traditional VSR models struggle with temporal unpredictability, speaker-dependent lip movements, and contextual ambiguity. To address these issues, this paper presents the Visual Attention LipNet Network (V-LipNet), which utilizes Particle Swarm Optimization (PSO) to adjust learning rates, convolutional filter sizes, and attention weights. V-LipNet dynamically focuses on lip movement features and captures long-range temporal associations via a self-attention mechanism and spatiotemporal convolutional layers. WER and SA were used to evaluate performance on benchmark datasets, including GRID, TCD-TIMIT, and LRS2. The findings reveal that PSO-optimized V-LipNet reduces WER better than traditional LipNet and LSTM, indicating that it can generalize to unseen speakers and noise. Results reveal that metaheuristic optimization and attention processes work effectively. Ultimately, V-LipNet provides a powerful and adaptable multimodal interface solution for assistive technology, language learning, communication networks, and human-robot interaction.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Ghazal et al. (Thu,) studied this question.

synapsesocial.com/papers/69c0df0bfddb9876e79c164e https://doi.org/https://doi.org/10.1016/j.procs.2026.01.104

Bookmark

View Full Paper