Background: Emotional recognition relies on the integration of multiple affective cues. In everyday contexts, however, facial expressions, vocal prosody, and semantic content may convey incongruent emotional information, generating emotional conflict and increasing cognitive demands. Objective: The present study examined how multimodal emotional conflict affects emotion recognition during video viewing, focusing on short videos in which a single actor simultaneously conveyed incongruent emotional cues across facial, vocal, and semantic channels. Methods: Forty-seven undergraduate students completed a gaze-based response task in which, after each short video, they provided a single judgment of the overall emotion conveyed by the stimulus. The videos depicted either congruent or incongruent combinations of semantic content, facial expressions, and vocal prosody across six basic emotions and a neutral condition. Data were analyzed using repeated-measures ANOVAs and generalized linear mixed-effects models. Results: Accuracy was consistently higher for congruent than incongruent stimuli across all domains, indicating a robust emotional interference effect. Critically, the magnitude of this effect differed by domain. Semantic content showed the largest performance reduction under incongruence, followed by facial expression and vocal prosody. Mixed-effects models confirmed these effects while accounting for participant- and item-level variability and revealed a significant Congruency × Domain interaction. Conclusions: In a gaze-based response task requiring a single overall emotion judgment, emotional conflict disrupted recognition in a domain-specific manner, with semantic information being particularly vulnerable to multimodal interference.
Santis et al. (Mon,) studied this question.