March 3, 2026

Clue and Context Fusion for Sarcasm Detection with Large Multimodal Models

Key Points

Detection of sarcasm accuracy reached 87.92% with a robust multimodal model addressing pragmatic contradiction.
The framework SCARF integrates explicit sarcasm cues, enhancing context-sensitive retrieval for multimodal interactions.
Ablation studies indicate that sarcasm-clue fusion significantly drives performance improvements in detection outcomes.
Findings highlight the importance of tailored approaches for nuanced comprehension in social media contexts.

Abstract

Detecting sarcasm in social media is fundamentally different from general VLM benchmarks: it is a pragmatic contradiction problem in which the literal signal in one modality is intentionally misaligned with the intended meaning, while dominant pretraining (e.g., CLIP-style contrastive agreement) biases models toward modality alignment rather than incongruity detection. We present SCARF, a contradiction-aware framework that equips large multimodal models with explicit sarcasm cues and context-sensitive retrieval. SCARF constructs coarse scene cues and fine localized evidence via tag-constrained QA, then distills them with visual tokens into a FUSION control vector for the LLM; a label-contrastive retriever supplies type- and context-matched exemplars, and a local multi-view encoder surfaces micro-cues. With the same backbone and training data, SCARF attains 87.92% Acc / 86.67% F1 on MMSD2.0 and 77.14% Acc / 76.44% F1 zero-shot on XDMSD, outperforming a comparably fine-tuned LLaVA-1.5. Ablations show sarcasm-clue fusion is the main driver of gains, and tag-constrained QA improves rationale grounding and reduces hallucinations.

Bookmark

Cite This Study

Li et al. (Tue,) studied this question.

synapsesocial.com/papers/69a75ae2c6e9836116a21501 https://doi.org/https://doi.org/10.1145/3793680

Bookmark