What question did this study set out to answer?

May 7, 2026Open Access

A Self-Adaptive LLM-Based Framework for Automated Extraction and Structuring of Earthquake Information from Heterogeneous Web Sources

Puntos clave

To develop a framework that automates the extraction and structuring of earthquake-related information from diverse web sources.
Proposed a self-adaptive framework based on large language models for information extraction.
Integrated transformer-based schema generation and repository-guided schema matching.
Utilized an iterative refinement mechanism to adapt to varying document structures.
Achieved a success rate of 85% across evaluated models.
Best-performing configuration reached an extraction accuracy of 96.5%.
Significantly reduced false positives and false negatives in information extraction.

Resumen

The rapid growth of heterogeneous web sources has created significant challenges for the automated extraction and structuring of critical domain-specific information, particularly in real-time seismic monitoring scenarios. Despite the existence of official governmental reporting systems, relevant earthquake-related data are often distributed across diverse online platforms with highly variable and dynamically evolving HTML (HyperText Markup Language) structures, leading to incomplete, delayed, or inconsistent information retrieval. Existing rule-based and semi-automated approaches lack scalability and robustness under such conditions. To address this gap, this study proposes a self-adaptive framework based on large language models (LLMs) for the automated extraction and structuring of earthquake-related web content. The proposed approach integrates transformer-based schema generation, repository-guided schema matching, and an iterative refinement mechanism, enabling the system to dynamically adapt to heterogeneous document structures. A formal utility-based decision mechanism is introduced to optimize schema selection and reuse, while embedding-based similarity modeling facilitates efficient transfer of extraction patterns across structurally related webpages. The experimental evaluation was conducted on a heterogeneous benchmark dataset comprising multiple web domains with diverse structural characteristics. The results demonstrate that the proposed framework achieves a success rate of 85% across all evaluated models, with the best-performing configuration reaching an extraction accuracy of 96.5% and a final composite score of 84.26. Additional analysis reveals significant improvements in extraction completeness, reduction in false positives and false negatives, and effective reuse of a compact set of robust schemas. Error analysis indicates that the primary challenges are associated with noisy HTML structures and incorrect DOM (Document Object Model) element selection, rather than deficiencies in textual content. The findings confirm that combining lightweight transformer models with adaptive memory and schema reuse mechanisms enables the development of scalable, robust, and high-performance web extraction systems. The proposed approach is particularly suitable for real-time information retrieval in safety-critical domains, where timely and accurate data aggregation from heterogeneous sources is essential.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Turarbek et al. (Tue,) studied this question.

synapsesocial.com/papers/69fbf004164b5133a91a4297 https://doi.org/https://doi.org/10.3390/computers15050294

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo