What question did this study set out to answer?

The aim is to enhance speech clarity and intelligibility in noisy environments using emotional cues.

January 18, 2026

Audio-visual speech enhancement in noisy environments using emotion-based contextual cues

Key Points

The aim is to enhance speech clarity and intelligibility in noisy environments using emotional cues.
Proposed a deep learning-based emotion-aware audio-visual speech enhancement system (EAVSE).
Extracted emotional features from facial landmarks to integrate with audio and visual data.
Used a UNet-based encoder-decoder network for optimizing multi-model data.
Evaluated model performance using the Carnegie Mellon University dataset with emotional annotations.
Achieved significant improvements in objective and subjective speech quality metrics.
Demonstrated Δ STOI of 7.32%, Δ PESQ of 0.33, and Δ S-SNR of 7.8 dB over benchmark models.
Highlighted the effectiveness of emotional contextual cues in enhancing speech in noisy environments.

Abstract

In real-world environments, background noise significantly degrades the intelligibility and clarity of human speech. Existing audio-visual speech enhancement (AVSE) techniques often pose challenges in dynamic and noisy conditions. This study examines the inclusion of emotional features as a novel contextual cue within the AVSE framework. We analyze that incorporating emotional understanding from facial landmarks improves speech enhancement performance. We propose a deep learning–based emotion-aware audio-visual speech enhancement system (EAVSE) that uses auditory, visual, and emotional information. The proposed EAVSE extracts emotional features from facial landmarks and combines them with audio and visual modalities. Enriched multi-model data are processed by a UNet-based encoder–decoder network for joint learning and optimization. The network iteratively refines the feature representation, guided by a distortion-inspired loss function. We train and evaluate the model on the Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity dataset, known for its diverse audio-visual recordings with annotated emotions. Compared to AVSE benchmark and audio-only speech enhancement systems, the proposed model achieves significant improvements in both objective Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI) and subjective speech quality metrics. In particular, the scale-invariant signal-to-distortion ratio loss function demonstrates superior performance. This suggests the usefulness of the emotional contextual cues for AVSE. The experimental findings demonstrate the effectiveness of the AVSE, particularly in challenging noisy environments signal-to-noise ratio (SNR) ≤ −7.5 dB. The proposed model achieved Δ STOI of 7.32%, Δ PESQ of 0.33, and Δ S-SNR of 7.8 dB over noisy benchmark at 0 dB SNR.

Bookmark

Audio-visual speech enhancement in noisy environments using emotion-based contextual cues

Key Points

Abstract

Cite This Study