In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Existing audio-visual speech enhancement (AVSE) techniques often pose challenges in dynamic and noisy conditions. This study examines the inclusion of emotional features as a novel contextual cue within the AVSE framework. We analyze that incorporating emotional understanding from facial landmarks improves speech enhancement performance. We propose a deep learning–based emotion-aware audio-visual speech enhancement system (EAVSE) that uses auditory, visual, and emotional information. The proposed EAVSE extracts emotional features from facial landmarks and combines them with audio and visual modalities. Enriched multi-model data are processed by a UNet-based encoder–decoder network for joint learning and optimization. The network iteratively refines the feature representation, guided by a distortion-inspired loss function. We train and evaluate the model on the Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity dataset, known for its diverse audio-visual recordings with annotated emotions. Compared to AVSE benchmark and audio-only speech enhancement systems, the proposed model achieves significant improvements in both objective Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI) and subjective speech quality metrics. In particular, the scale-invariant signal-to-distortion ratio loss function demonstrates superior performance. This suggests the usefulness of the emotional contextual cues for AVSE. The experimental findings demonstrate the effectiveness of the AVSE, particularly in challenging noisy environments signal-to-noise ratio (SNR) ≤ −7.5 dB. The proposed model achieved Δ STOI of 7.32%, Δ PESQ of 0.33, and Δ S-SNR of 7.8 dB over noisy benchmark at 0 dB SNR.
Hussain et al. (Thu,) studied this question.