The rapid growth of heterogeneous web sources has created significant challenges for the automated extraction and structuring of critical domain-specific information, particularly in real-time seismic monitoring scenarios. Despite the existence of official governmental reporting systems, relevant earthquake-related data are often distributed across diverse online platforms with highly variable and dynamically evolving HTML (HyperText Markup Language) structures, leading to incomplete, delayed, or inconsistent information retrieval. Existing rule-based and semi-automated approaches lack scalability and robustness under such conditions. To address this gap, this study proposes a self-adaptive framework based on large language models (LLMs) for the automated extraction and structuring of earthquake-related web content. The proposed approach integrates transformer-based schema generation, repository-guided schema matching, and an iterative refinement mechanism, enabling the system to dynamically adapt to heterogeneous document structures. A formal utility-based decision mechanism is introduced to optimize schema selection and reuse, while embedding-based similarity modeling facilitates efficient transfer of extraction patterns across structurally related webpages. The experimental evaluation was conducted on a heterogeneous benchmark dataset comprising multiple web domains with diverse structural characteristics. The results demonstrate that the proposed framework achieves a success rate of 85% across all evaluated models, with the best-performing configuration reaching an extraction accuracy of 96.5% and a final composite score of 84.26. Additional analysis reveals significant improvements in extraction completeness, reduction in false positives and false negatives, and effective reuse of a compact set of robust schemas. Error analysis indicates that the primary challenges are associated with noisy HTML structures and incorrect DOM (Document Object Model) element selection, rather than deficiencies in textual content. The findings confirm that combining lightweight transformer models with adaptive memory and schema reuse mechanisms enables the development of scalable, robust, and high-performance web extraction systems. The proposed approach is particularly suitable for real-time information retrieval in safety-critical domains, where timely and accurate data aggregation from heterogeneous sources is essential.
Building similarity graph...
Analyzing shared references across papers
Loading...
Assem Turarbek
Diana Rakhimova
Yeldos Adetbekov
Computers
Al-Farabi Kazakh National University
Satbayev University
Building similarity graph...
Analyzing shared references across papers
Loading...
Turarbek et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69fbf004164b5133a91a4297 — DOI: https://doi.org/10.3390/computers15050294