What question did this study set out to answer?

The study aims to develop an efficient system for detecting offensive comments in Brazilian Portuguese using machine learning.

March 26, 2026Open Access

Machine Learning-Based Detection of Offensive Content in YouTube Comments

Key Points

The study aims to develop an efficient system for detecting offensive comments in Brazilian Portuguese using machine learning.
Analyzed a dataset of 4,139 YouTube comments in Brazilian Portuguese, manually labeled for classification.
Compared four classical text classification algorithms: Naive Bayes, SVM, Random Forest, and GBM.
Employs CountVectorizer and TF-IDF vectorization methods to transform comments into numerical representations.
The Random Forest model with CountVectorizer achieved the highest accuracy of 86%.
Demonstrated the feasibility of classical machine learning techniques for content moderation in Brazilian Portuguese.

Abstract

Offensive comments and hate speech have become a challenge for content moderation on virtual social networks, and research on automated moderation techniques for Brazilian Portuguese is still limited. In this context, this study aims to contribute to the development of an efficient system for detecting and classifying offensive comments in Brazilian Portuguese using natural language processing and machine learning techniques. The adopted approach explores a novel dataset composed of 4,139 comments in Brazilian Portuguese extracted from YouTube and manually labeled. The goal is to automatically detect and classify offensive comments. Four classical text classification algorithms — Naive Bayes, SVM, Random Forest, and GBM — were compared, applied to the vectorizers CountVectorizer and TF-IDF. The Random Forest model, combined with CountVectorizer, showed the best performance, achieving 86% of accuracy. This result highlights the feasibility of using classical machine learning methods for content moderation in Brazilian Portuguese. This study contributes to the construction and availability of a specialized dataset, promoting advances in the field of automated moderation and providing a useful resource for the development of models focused on the Portuguese language. Thus, the work reinforces the potential of machine learning to promote safer and more inclusive online environments.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper