What question did this study set out to answer?

This research aims to evaluate the effectiveness of small language models for detecting phishing websites and analyze their operational implications.

March 7, 2026Open Access

Small Language Models for Phishing Website Detection: Cost, Performance, and Privacy Trade-Offs

Key Points

This research aims to evaluate the effectiveness of small language models for detecting phishing websites and analyze their operational implications.
Developed a detection pipeline for malicious websites using small language models (SLMs).
Systematically evaluated 15 SLMs, ranging from 1 billion to 70 billion parameters.
Assessed classification accuracy, computational requirements, and cost-efficiency of the models.
The best SLM achieved an F1-score of 0.893, compared to 0.929 for state-of-the-art proprietary models.
SLMs underperform relative to large language models but present moderate gaps in detection performance.
SLMs allow deployment on local infrastructure, enhancing data control and potentially reducing operational costs.

Abstract

Phishing websites pose a major cybersecurity threat, exploiting unsuspecting users and causing significant financial and organisational harm. Traditional machine learning approaches for phishing detection often require extensive feature engineering, continuous retraining, and costly infrastructure maintenance. At the same time, proprietary large language models (LLMs) have demonstrated strong performance in phishing-related classification tasks, but their operational costs and reliance on external providers limit their practical adoption in many business environments. This paper presents a detection pipeline for malicious websites and investigates the feasibility of Small Language Models (SLMs) using raw HTML code and URLs. A key advantage of these models is that they can be deployed on local infrastructure, providing organisations with greater control over data and operations. We systematically evaluate 15 commonly used SLMs, ranging from 1 billion to 70 billion parameters, benchmarking their classification accuracy, computational requirements, and cost-efficiency. Our results highlight the trade-offs between detection performance and resource consumption. While SLMs underperform compared to state-of-the-art proprietary LLMs, the gap is moderate: the best SLM achieves an F1-score of 0.893 (Llama3.3:70B), compared to 0.929 for GPT-5.2, indicating that open-source models can provide a viable and scalable alternative to external LLM services.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Goldenits et al. (Thu,) studied this question.

synapsesocial.com/papers/69abc2355af8044f7a4eb88a https://doi.org/https://doi.org/10.3390/jcp6020048

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper