Memes have become a dominant medium of online expression, blending humor, satire, and cultural commentary through visual and textual elements. While often used for entertainment and community building, memes can propagate hate speech in subtle and implicit ways, making automatic detection particularly challenging. This study introduces the Indonesian Multimodal Meme Dataset (INDOMEME), the first expert-annotated multimodal dataset for hateful meme detection in the Indonesian language. The dataset contains 5,023 memes collected from Facebook and annotated under three complementary schemes: hatefulness, appropriateness, and topical focus. Each meme is further enriched with optical character recognition (OCR) text and machine-generated captions, providing a comprehensive resource for multimodal analysis. Using this dataset, the study conducts extensive experiments addressing four research questions. First, unimodal models (text-only and image-only) are benchmarked against multimodal fusion models, showing that multimodal approaches outperform unimodal baselines; the best multimodal model (IndoBERTweet + Visual Transformers (ViT)) achieves a macro-F1 of 0.820 on hate speech detection and 0.809 on appropriateness classification. Second, several state-of-the-art multimodal large language models (MLLMs), including GPT-4o, Gemini 2.5 Flash, and Gemma3 27B, are evaluated in zero-shot settings, with GPT-4o reaching a macro-F1 of 0.772 for appropriateness detection, although MLLMs remain less effective for hatefulness classification compared to supervised approaches. Finally, multitask learning is explored by jointly modeling appropriateness and hatefulness using a dual-head architecture, demonstrating consistent performance gains across text-only models. These findings underscore the benefit of multimodal resources and multitask architectures in advancing Indonesian meme hate speech detection.
Pamungkas et al. (Fri,) studied this question.