Detecting sarcasm in social media is fundamentally different from general VLM benchmarks: it is a pragmatic contradiction problem in which the literal signal in one modality is intentionally misaligned with the intended meaning, while dominant pretraining (e.g., CLIP-style contrastive agreement) biases models toward modality alignment rather than incongruity detection. We present SCARF, a contradiction-aware framework that equips large multimodal models with explicit sarcasm cues and context-sensitive retrieval. SCARF constructs coarse scene cues and fine localized evidence via tag-constrained QA, then distills them with visual tokens into a FUSION control vector for the LLM; a label-contrastive retriever supplies type- and context-matched exemplars, and a local multi-view encoder surfaces micro-cues. With the same backbone and training data, SCARF attains 87.92% Acc / 86.67% F1 on MMSD2.0 and 77.14% Acc / 76.44% F1 zero-shot on XDMSD, outperforming a comparably fine-tuned LLaVA-1.5. Ablations show sarcasm-clue fusion is the main driver of gains, and tag-constrained QA improves rationale grounding and reduces hallucinations.
Li et al. (Tue,) studied this question.