Abstract The integration of Large Language Models (LLMs) with web scraping and crawling techniques is transforming automated web data extraction by enabling semantic understanding and adaptability. This Systematic Literature Review (SLR) synthesizes evidence regarding this integration, focusing on tools, models, challenges, evaluation methods, trends, and applications. Following PRISMA guidelines, we conducted a rigorous search across Scopus, Web of Science, ACM, and IEEE databases (2021–2025). From 976 screened records, 91 high-quality studies (53 conference papers and 38 journal articles) were selected after duplicate removal, screening, and AI-powered quality assessment. The field has experienced explosive growth, with 84% of publications appearing in 2024–2025 alone (36 in 2024, 40 in 2025). Key tools include Scrapy, BeautifulSoup, and Selenium, with emerging LLM-augmented tools like Scrapeghost, Crawl4AI, and ScrapeGraphAI. While transformer-based models dominate (86 of 91 papers), the landscape is diversifying: the BERT family appears in 23 studies, the GPT family in 34, and other LLMs (Llama, Mistral, Claude, Gemini) in 44. Major challenges involve HTML complexity, computational costs, token limits, data biases, and legal risks. Evaluation relies on hybrid frameworks combining task-specific metrics (F1, BLEU, RAGAS), human validation, and operational efficiency measures. Applications span Cybersecurity, Healthcare, Education, E-commerce, Media, Technology, and Finance/Legal, with high thematic specialization. A notable trend is the shift toward efficient Small Language Models (SLMs) for resource-constrained, domain-specific tasks. The findings suggest that LLMs are enabling a decisive transition from rule-based to semantic, agentic approaches in web extraction. Challenges in robustness and efficiency persist, but trends point toward intelligent, domain-specialized, and ethically aware systems. Future work should explore SLM implementation, hybrid pipelines, and standardized evaluation benchmarks.
Landeta-López et al. (Thu,) studied this question.