What question did this study set out to answer?

This research assesses the effectiveness of large language models in moderating online toxicity.

December 13, 2025Open Access

Guardians of digital safety: benchmarking large language models in the fight against online toxicity

Key Points

This research assesses the effectiveness of large language models in moderating online toxicity.
Evaluated various LLM-based moderation solutions and their detection accuracy.
Investigated performance under adversarial attacks like input perturbation.
Utilized textual datasets, including social media posts and product reviews.
LLM-based approaches showed high accuracy in detecting harmful content.
Models exhibited low rates of false positives and negatives.
Findings highlight the robustness of language models under adversarial conditions.

Abstract

Abstract The extensive spread of harmful content, including hate speech, harassment, violent and adult material across online platforms and media channels poses significant challenges and has raised considerable concern within various societal groups. Government bodies, educators, and parents frequently find themselves in disagreement with media providers over the best methods to regulate and restrict such content. Automated content moderation technologies have become critical tools in addressing these challenges, particularly through natural language processing (NLP) techniques that can automatically detect and filter sensitive textual content, such as offensive language, violence, and adult materials, enabling platforms to enforce moderation policies on a large scale. Despite their widespread use, current moderation technologies face challenges related to detection accuracy, often resulting in substantial false positives and false negatives. To enhance content moderation systems, more advanced algorithms capable of accurately interpreting textual context are necessary. In this study, we assess current large language model (LLM)-based moderation solutions, specifically OpenAI moderation model and Llama-Guard-3, examining their effectiveness in detecting sensitive content. Additionally, we investigate the capabilities of contemporary LLMs, including OpenAI generative pre-trained transformer (GPT), Google Gemini, Meta Llama, Anthropic Claude, and small language models (SLMs) such as Google Gemma, in recognizing inappropriate content from diverse media sources. We also studied the performance of these models under adversarial attacks such as input perturbation and prompt injection. Our evaluation and comparative analysis utilize various textual datasets, such as X (Twitter) posts, Amazon product reviews, and news articles. The findings indicate that LLM-based approaches significantly demonstrate high accuracy and low rates of false positives and negatives. They also indicate the robustness of the models under various adversarial attacks. These results underscore the considerable potential for integrating advanced LLMs into websites and social media platforms, thereby enhancing content regulation and moderation effectiveness.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper