Spam messages are unwanted, irrelevant, or potentially harmful messages sent in bulk to large numbers of recipients via email, SMS, or social media. These messages pose a threat of spam to individual users and commercial companies. They threaten digital communication platforms by enabling phishing, malware distribution, service disruption, and unsolicited advertisements. Several mechanisms have been used in the literature to detect spam over digital communication systems. This includes rule-based filtering, Bayesian filtering, heuristic analysis, and machine learning (ML) techniques. Traditional rule-based and heuristic analyses were insufficient to cope with evolving attack patterns. Meanwhile, ML models can present modern, dynamic, appropriate, and efficient solutions in this manner. This study aims to evaluate and compare several basic ML models for spam detection, considering popular benchmark datasets on several communication platforms as a comprehensive comparative study. The experimental results demonstrate that the tested models achieve good accuracy, precision, recall, and F1-score on each investigated benchmark dataset. However, the performance of all models has decreased drastically when the trained models are tested on an unseen dataset. Recommendations for future required enhancements to handle this reduction in the performance of ML techniques for unseen datasets are provided. Finally, extra experimental tests have shown the positive impact of applying some of these recommendations.
Younes et al. (Tue,) studied this question.