Abstract The extensive spread of harmful content, including hate speech, harassment, violent and adult material across online platforms and media channels poses significant challenges and has raised considerable concern within various societal groups. Government bodies, educators, and parents frequently find themselves in disagreement with media providers over the best methods to regulate and restrict such content. Automated content moderation technologies have become critical tools in addressing these challenges, particularly through natural language processing (NLP) techniques that can automatically detect and filter sensitive textual content, such as offensive language, violence, and adult materials, enabling platforms to enforce moderation policies on a large scale. Despite their widespread use, current moderation technologies face challenges related to detection accuracy, often resulting in substantial false positives and false negatives. To enhance content moderation systems, more advanced algorithms capable of accurately interpreting textual context are necessary. In this study, we assess current large language model (LLM)-based moderation solutions, specifically OpenAI moderation model and Llama-Guard-3, examining their effectiveness in detecting sensitive content. Additionally, we investigate the capabilities of contemporary LLMs, including OpenAI generative pre-trained transformer (GPT), Google Gemini, Meta Llama, Anthropic Claude, and small language models (SLMs) such as Google Gemma, in recognizing inappropriate content from diverse media sources. We also studied the performance of these models under adversarial attacks such as input perturbation and prompt injection. Our evaluation and comparative analysis utilize various textual datasets, such as X (Twitter) posts, Amazon product reviews, and news articles. The findings indicate that LLM-based approaches significantly demonstrate high accuracy and low rates of false positives and negatives. They also indicate the robustness of the models under various adversarial attacks. These results underscore the considerable potential for integrating advanced LLMs into websites and social media platforms, thereby enhancing content regulation and moderation effectiveness.
Building similarity graph...
Analyzing shared references across papers
Loading...
Nouar AlDahoul
Myles Joshua Toledo Tan
Harishwar Reddy Kasireddy
Journal Of Big Data
University of Florida
Building similarity graph...
Analyzing shared references across papers
Loading...
AlDahoul et al. (Sat,) studied this question.
www.synapsesocial.com/papers/694018f82d562116f28f5e14 — DOI: https://doi.org/10.1186/s40537-025-01336-x
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: