March 23, 2021

Multimodal and Context-Aware Emotion Perception Model With Multiplicative Fusion

Key Points

Key points are not available for this paper at this time.

Abstract

We present a learning model for multimodal context-aware emotion recognition. Our approach combines multiple human co-occurring modalities (such as facial, audio, textual, and pose/gaits) and two interpretations of context. To gather and encode background semantic information for the first context interpretation from the input image/video, we use a self-attention-based CNN to encode. Similarly, for modeling the sociodynamic interactions among people (second context interpretation) in the input image/video, we use depth maps. We use multiplicative fusion to combine the modality and context channels, which learn to focus on the more informative input channels and suppress others for every incoming datapoint. We demonstrate the efficiency of our model on four benchmark emotion recognition datasets (IEMOCAP, CMU-MOSEI, EMOTIC, and GroupWalk). Our model outperforms on state of the art (SOTA) learning methods with an average 5\%-9\%5%-9% increase over all the datasets. We also perform ablation studies to motivate the importance of multimodality, context, and multiplicative fusion.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Trisha Mittal

Dolby (Netherlands)

Aniket Bera

Purdue University West Lafayette

Dinesh Manocha

University of North Carolina at Chapel Hill

Journals

IEEE Multimedia

Actions

Institutions

University of Maryland, College Park

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Multimodal and Context-Aware Emotion Perception Model With Multiplicative Fusion

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study