Hateful memes are internet memes that spread virally by overlaying short text on images, often containing offensive content targeting groups based on gender, religion, race, or other characteristics. Their rapid dissemination and harmful impact make targeted detection critically important. Multimodal models, capable of simultaneously processing images and text, can accurately identify hateful content in memes. This paper analyzes the image-text fusion methods, optimization strategies, and evaluation metrics of multimodal models in hateful meme detection. Results show that incorporating cross-attention mechanisms during the image-text fusion stage effectively captures complementary information between modalities, thereby enhancing downstream task performance. Furthermore, optimization techniques such as multi-task learning and adversarial training can further improve model robustness and detection accuracy. Model distillation techniques enable faster detection with minimal accuracy loss, facilitating the timely identification of newly released hateful memes. In summary, this paper argues that multimodal models hold significant potential for mitigating the spread of online hate and provides theoretical and practical references for related research through an analysis of image-text fusion methods, optimization strategies, and evaluation metrics.
Mengqi Yan (Wed,) studied this question.