Key points are not available for this paper at this time.
With the continuous emergence of various types of social media, which people often use to express their emotions in daily life, the multi-modal sarcasm detection (MSD) task has attracted more and more attention. However, due to the unique nature of sarcasm itself, there are still two main challenges on the way to achieving robust MSD: 1) existing mainstream methods often fail to take into account the problem of multi-modal weak correlation, thus ignoring the important sarcasm information of the uni-modal itself; 2) inefficiency in modeling cross-modal interactions in unaligned multi-modal data. Therefore, this paper proposes a multi-task jointly trained aggregation network (MTAN), which mainly adopts networks adapted to different modalities according to different modality processing tasks. Specifically, we design a multi-task CLIP framework that includes an uni-modal text task, an uni-modal image task, and a cross-modal interaction task, which can utilize sentiment cues from multiple tasks for multi-modal sarcasm detection. In addition, we design a global-local cross-modal interaction learning method that utilizes discourse-level representations from each modality as the global multi-modal context to interact with local uni-modal features, which not only avoids the secondary scaling cost of previous local-local cross-modal interaction methods but also allows the global multi-modal context and local uni-modal features to be mutually reinforcing and progressively improved through multi-layer superposition. After extensive experimental results and in-depth analysis, our model achieves state-of-the-art performance in multi-modal sarcasm detection.
Ou et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: