What type of study is this?

This is a Quantitative Study study.

September 29, 2025Open Access

Ensemble Learning Approaches for SMS Spam Detection: A Comparative Study of Text Classification Models

Key Points

Ensemble methods consistently improved spam detection accuracy, leading to F1 scores of 0.99 under imbalanced data conditions.
The Relevance Vector Machine demonstrated the best performance with an F1 score of 0.975, validating the model's effectiveness.
TF-IDF was used for transforming messages into numerical representations, emphasizing uncommon keywords to enhance classification.
Alternative methods like Logistic Regression also provided strong baselines, affirming their reliability for spam detection.

Abstract

For users who rely on single-use mobile phones, the global problem of receiving unwanted marketing messages through SMS remains a significant concern. In recent years, extensive use of machine learning and deep learning approaches has been explored to address this challenge. To improve predictive accuracy, the outputs of multiple models were combined using a majority-voting strategy. This work presents a comparative analysis of several text classification techniques, highlighting the importance of reliably identifying and labeling spam SMS messages. After data preprocessing, messages were transformed into numerical representations using TF-IDF, which emphasizes uncommon but informative terms over frequent ones. Among the tested methods, the Relevance Vector Machine achieved the strongest performance in the data, reaching an F1 is 0.975176. In addition, this examined alternative spam detection algorithms, including Logistic Regression, XGBoost, and LightGBM. The preprocessing pipeline included duplicate removal, text normalization with spaCy, label encoding, and TF-IDF vectorization. Two experimental conditions were evaluated: one without handling class imbalance and another with imbalance adjustment. Results showed that ensemble-based methods, particularly Gradient Boosting, XGBoost, and LightGBM, consistently delivered superior performance. Under imbalanced data conditions, both XGBoost and LightGBM achieved F1 scores of 0.99 across the majority and minority classes. When class imbalance was corrected, their performance remained strong, with F1 scores of 0.98 for all classes. Logistic Regression also demonstrated robust results, confirming its role as a reliable baseline. Overall, the findings indicate that the proposed RVM framework is effective for SMS spam classification and has practical applicability in real-world scenarios.

Ensemble Learning Approaches for SMS Spam Detection: A Comparative Study of Text Classification Models

Key Points

Abstract

Cite This Study

Also Consider

Also Consider