Key points are not available for this paper at this time.
We present a learning model for multimodal context-aware emotion recognition. Our approach combines multiple human co-occurring modalities (such as facial, audio, textual, and pose/gaits) and two interpretations of context. To gather and encode background semantic information for the first context interpretation from the input image/video, we use a self-attention-based CNN to encode. Similarly, for modeling the sociodynamic interactions among people (second context interpretation) in the input image/video, we use depth maps. We use multiplicative fusion to combine the modality and context channels, which learn to focus on the more informative input channels and suppress others for every incoming datapoint. We demonstrate the efficiency of our model on four benchmark emotion recognition datasets (IEMOCAP, CMU-MOSEI, EMOTIC, and GroupWalk). Our model outperforms on state of the art (SOTA) learning methods with an average 5\%-9\%5%-9% increase over all the datasets. We also perform ablation studies to motivate the importance of multimodality, context, and multiplicative fusion.
Building similarity graph...
Analyzing shared references across papers
Loading...
Trisha Mittal
Dolby (Netherlands)
Aniket Bera
Purdue University West Lafayette
Dinesh Manocha
University of North Carolina at Chapel Hill
IEEE Multimedia
University of Maryland, College Park
Building similarity graph...
Analyzing shared references across papers
Loading...
Mittal et al. (Tue,) studied this question.
synapsesocial.com/papers/69e648185cb6e92637e7088e — DOI: https://doi.org/10.1109/mmul.2021.3068387