Key points are not available for this paper at this time.
Borehole log reports contain critical information for foundation design and seismic risk assessment. However, automating the data extraction of site investigation reports remains challenges in geotechnical construction due to varied report formats and the semi-structured nature of the data. Inconsistent table layouts and translation-induced variations in headings introduce further complexity, often requiring manual interpretation. To address this problem, we propose a cross-lingual pipeline that (i) translates non-English reports into English using a GPT-based API, (ii) represents all tokens with domain-optimized word embeddings distilled from selected SCI articles, and (iii) applies a lightweight 1-D CNN and machine learning algorithms to recognize key headings despite layout inconsistencies and translation-induced synonymy. The results showed that the proposed image processing and classifier delivered reliable accuracy of 0.997. Therefore, this study presents a practical solution for cross-lingual automation of semi-structured site investigation data, alleviating a long-standing bottleneck in geotechnical construction workflows by leveraging scholarly accumulated knowledge.
Yang et al. (Thu,) studied this question.