What question did this study set out to answer?

This review aims to synthesize evidence on the use of large language models in web scraping and crawling techniques.

May 16, 2026Open Access

LLMs applied to web scraping and web crawling: a systematic review

Key Points

This review aims to synthesize evidence on the use of large language models in web scraping and crawling techniques.
Conducted a systematic literature review following PRISMA guidelines across major databases
Screened 976 records, resulting in 91 high-quality studies selected based on AI-powered quality assessment
Analyzed tools, models, challenges, evaluation methods, and applications within the publications.
84% of publications emerged in 2024-2025, highlighting rapid growth in the field.
86 of 91 studies utilized transformer-based models, showing their dominance in web scraping and crawling tasks.
Challenges identified include HTML complexity, computational costs, and data biases, indicating ongoing issues.

Abstract

Abstract The integration of Large Language Models (LLMs) with web scraping and crawling techniques is transforming automated web data extraction by enabling semantic understanding and adaptability. This Systematic Literature Review (SLR) synthesizes evidence regarding this integration, focusing on tools, models, challenges, evaluation methods, trends, and applications. Following PRISMA guidelines, we conducted a rigorous search across Scopus, Web of Science, ACM, and IEEE databases (2021–2025). From 976 screened records, 91 high-quality studies (53 conference papers and 38 journal articles) were selected after duplicate removal, screening, and AI-powered quality assessment. The field has experienced explosive growth, with 84% of publications appearing in 2024–2025 alone (36 in 2024, 40 in 2025). Key tools include Scrapy, BeautifulSoup, and Selenium, with emerging LLM-augmented tools like Scrapeghost, Crawl4AI, and ScrapeGraphAI. While transformer-based models dominate (86 of 91 papers), the landscape is diversifying: the BERT family appears in 23 studies, the GPT family in 34, and other LLMs (Llama, Mistral, Claude, Gemini) in 44. Major challenges involve HTML complexity, computational costs, token limits, data biases, and legal risks. Evaluation relies on hybrid frameworks combining task-specific metrics (F1, BLEU, RAGAS), human validation, and operational efficiency measures. Applications span Cybersecurity, Healthcare, Education, E-commerce, Media, Technology, and Finance/Legal, with high thematic specialization. A notable trend is the shift toward efficient Small Language Models (SLMs) for resource-constrained, domain-specific tasks. The findings suggest that LLMs are enabling a decisive transition from rule-based to semantic, agentic approaches in web extraction. Challenges in robustness and efficiency persist, but trends point toward intelligent, domain-specialized, and ethically aware systems. Future work should explore SLM implementation, hybrid pipelines, and standardized evaluation benchmarks.

Bookmark

View Full Paper

Cite This Study

Landeta-López et al. (Thu,) studied this question.

synapsesocial.com/papers/6a0809bea487c87a6a40b81b https://doi.org/https://doi.org/10.1007/s00607-026-01666-5

Bookmark

View Full Paper