May 21, 2021

A Unimodal Representation Learning and Recurrent Decomposition Fusion Structure for Utterance-Level Multimodal Embedding Learning

Key Points

Key points are not available for this paper at this time.

Abstract

Learning a unified embedding for utterance-level video attracts significant attention recently due to the rapid development of social media and its broad applications. An utterance normally contains not only spoken language but also the nonverbal behaviors such as facial expressions and vocal patterns. Instead of directly learning utterance embedding based on low-level features, we firstly explore high-level representation for each modality separately via an unimodal representation learning gyroscope structure. In this way, the learnt unimodal representations are more representative and contain more abstract semantic information. In the gyroscope structure, we introduce multi-scale kernel learning, ‘channel expansion’ and ‘channel fusion’ operations to explore high-level features both spatially and channelwise. Another insight of our method lies in that we fuse representations of all modalities to obtain a unified embedding by interpreting fusion procedure as the flow of inter-modality information between various modalities, which is more specialized in terms of the information to be fused and the fusion process. Specifically, considering that each modality carries modality-specific and cross-modality interactions, we innovate to decompose unimodal representations into intra- and inter-modality dynamics using gating mechanism, and further fuse the inter-modality dynamics by passing them from previous modalities to the following one using a recurrent neural fusion architecture. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple benchmark datasets.

اسأل الذكاء الاصطناعي

Bookmark

Cite This Study

Mai et al. (Fri,) studied this question.

synapsesocial.com/papers/69e648185cb6e92637e7088d https://doi.org/https://doi.org/10.1109/tmm.2021.3082398

Also Consider

Synapse has enriched 3 closely related papers on similar clinical questions. Consider them for comparative context:

اسأل الذكاء الاصطناعي

Bookmark