What question did this study set out to answer?

The study aims to evaluate the effectiveness of multimodal versus unimodal models for detecting hateful memes.

February 24, 2026Open Access

Why So Meme? A Comparative and Explainable Analysis of Multimodal Hateful Meme Detection

Key Points

The study aims to evaluate the effectiveness of multimodal versus unimodal models for detecting hateful memes.
Comparative analysis of multimodal (RoBERViT) and unimodal (RoBERTa, ViT) frameworks.
Evaluation across two datasets: Innopolis Hateful Memes and Facebook Hateful Meme.
Application of explainable AI techniques for qualitative insights into model reasoning.
Multimodal model achieved an F1-score of 0.6439 on the Innopolis dataset, outperforming the text-only score of 0.5794.
Text-only models remained competitive on the Facebook dataset, indicating a challenge with benign confounders.
Qualitative analysis revealed a reliance on surface-level keywords, with text dominating the reasoning process.

Abstract

The rise of toxic content, particularly in the form of hateful memes, poses a significant challenge to social media platforms. This paper presents an empirical comparative study of unimodal and multimodal architectures for toxic content detection. Rather than proposing a novel architecture, the study evaluates the efficacy of a modular Late Fusion framework (RoBERViT) against specialized unimodal baselines (RoBERTa and ViT) and a generalist Large Multimodal (LLaVA). Both unimodal and multimodal configurations across two distinct benchmarks—the imbalanced Innopolis Hateful Memes dataset and the confounder-driven Facebook Hateful Meme dataset—were explored. Beyond quantitative metrics, this study conducts a qualitative analysis using Explainable AI (LIME) and a Large Multimodal Model (LLaVA) to investigate model reasoning. Results demonstrate that the multimodal fusion model consistently outperformed its unimodal counterparts on the Innopolis Hateful Meme dataset, achieving a toxic class F1-score of 0.6439 compared to the text-only score of 0.5794. However, on the Facebook Hateful Meme dataset, text-only models remain competitive, highlighting the “benign confounder” challenge. The qualitative analysis reveals that text remains the dominant modality, with models often relying on surface-level keywords. Notably, the Vision Transformer frequently uses text overlays as a visual proxy for hate, while the LLaVA model struggles with hallucinated toxicity in benign confounder contexts. These findings underscore the persistent challenge of achieving true multimodal understanding in hate speech detection.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Azmi et al. (Sat,) studied this question.

synapsesocial.com/papers/699d401ade8e28729cf6518b https://doi.org/https://doi.org/10.3390/make8020050

Bookmark

View Full Paper