• Proposes SAGE-Net, a multimodal framework that models sarcasm as latent semantic incongruity. • Introduces a cross-gating mechanism using cosine similarity to filter semantically misaligned modal features. • Employs a hierarchical co-attention fusion strategy inspired by latent factor decomposition. • Integrates stylistic, visual, and textual cues to capture implicit sarcastic signals beyond surface meaning. • Enhances modality-level interpretability through gated and attention-based representations. Detecting sarcasm in social media is critical for enhancing sentiment analysis, content moderation, and online discourse understanding. This study proposes SAGE-Net (Sarcasm-Aware Gated Encoding Network). Unlike prior multimodal models that indiscriminately fuse all inputs, SAGE-Net introduces three innovations tailored for sarcasm: (1) a semantic gating mechanism that filters visually inconsistent text-image pairs before fusion; (2) a dedicated stylistic encoder (Hash-BERT) that treats hashtags and emojis as a separate modality; and (3) a hierarchical attention module that produces interpretable modality importance scores. Leveraging a publicly available multimodal Twitter dataset, the model extracts contextual text features using a domain-adapted BERT encoder, stylistic cues via a dedicated Hash-BERT model, and visual information through a ResNet-152 backbone. To address modality-level inconsistencies, a cross-gating mechanism evaluates alignment between text and image modalities, filtering noisy features using cosine similarity. These filtered embeddings are fused using cross- and co-attention modules, followed by a hierarchical attention mechanism that adaptively weighs fused modalities based on relevance. This fusion strategy aligns with latent factor decomposition principles to minimize semantic redundancy and amplify sarcastic cues. Extensive experiments demonstrate that SAGE-Net surpasses existing baseline models across multiple metrics, achieving a high F1-score while maintaining robustness to ambiguous and noisy inputs. Furthermore, an ablation study validates the contribution of each component—including the gating threshold, stylistic encoder, and hierarchical attention module—to overall model performance. The proposed approach provides a scalable and interpretable solution for sarcasm detection in real-world multimodal communication scenarios.
Mazroa et al. (Fri,) studied this question.