What question did this study set out to answer?

The aim is to improve sarcasm detection in social media using a multimodal framework that integrates contextual cues.

May 16, 2026Open Access

A Contrastive Multimodal Representation Learning for Sarcasm Detection with Commonsense Integration

Key Points

The aim is to improve sarcasm detection in social media using a multimodal framework that integrates contextual cues.
Developed SAGE-Net, a multimodal framework that uses a cross-gating mechanism for filtering features.
Utilized a domain-adapted BERT encoder, Hash-BERT for stylistic cues, and a ResNet-152 for visual information.
Implemented a hierarchical attention mechanism for adaptive modality relevance assessment.
SAGE-Net achieved a high F1-score, outperforming existing baseline models across multiple metrics.
Model demonstrated robustness to ambiguous and noisy inputs, validating its effectiveness in real-world scenarios.
Ablation study confirmed the importance of components such as the gating threshold and hierarchical attention.

Abstract

• Proposes SAGE-Net, a multimodal framework that models sarcasm as latent semantic incongruity. • Introduces a cross-gating mechanism using cosine similarity to filter semantically misaligned modal features. • Employs a hierarchical co-attention fusion strategy inspired by latent factor decomposition. • Integrates stylistic, visual, and textual cues to capture implicit sarcastic signals beyond surface meaning. • Enhances modality-level interpretability through gated and attention-based representations. Detecting sarcasm in social media is critical for enhancing sentiment analysis, content moderation, and online discourse understanding. This study proposes SAGE-Net (Sarcasm-Aware Gated Encoding Network). Unlike prior multimodal models that indiscriminately fuse all inputs, SAGE-Net introduces three innovations tailored for sarcasm: (1) a semantic gating mechanism that filters visually inconsistent text-image pairs before fusion; (2) a dedicated stylistic encoder (Hash-BERT) that treats hashtags and emojis as a separate modality; and (3) a hierarchical attention module that produces interpretable modality importance scores. Leveraging a publicly available multimodal Twitter dataset, the model extracts contextual text features using a domain-adapted BERT encoder, stylistic cues via a dedicated Hash-BERT model, and visual information through a ResNet-152 backbone. To address modality-level inconsistencies, a cross-gating mechanism evaluates alignment between text and image modalities, filtering noisy features using cosine similarity. These filtered embeddings are fused using cross- and co-attention modules, followed by a hierarchical attention mechanism that adaptively weighs fused modalities based on relevance. This fusion strategy aligns with latent factor decomposition principles to minimize semantic redundancy and amplify sarcastic cues. Extensive experiments demonstrate that SAGE-Net surpasses existing baseline models across multiple metrics, achieving a high F1-score while maintaining robustness to ambiguous and noisy inputs. Furthermore, an ablation study validates the contribution of each component—including the gating threshold, stylistic encoder, and hierarchical attention module—to overall model performance. The proposed approach provides a scalable and interpretable solution for sarcasm detection in real-world multimodal communication scenarios.

A Contrastive Multimodal Representation Learning for Sarcasm Detection with Commonsense Integration

Key Points

Abstract

Cite This Study