What question did this study set out to answer?

The aim is to evaluate the effectiveness of large language models in extracting knowledge from geological reports for engineering applications.

March 10, 2026Open Access

Leveraging large language models for automated knowledge extraction from geological reports

Key Points

The aim is to evaluate the effectiveness of large language models in extracting knowledge from geological reports for engineering applications.
Evaluated eight state-of-the-art large language models for knowledge graph construction and question answering tasks.
Utilized prompt engineering techniques including in-context learning, chain-of-thought, and knowledge-injected strategies.
Conducted human evaluations to assess factual consistency across models.
DeepSeek-V3 excelled in knowledge graph construction tasks, while DeepSeek-R1 was superior in question answering tasks.
Prompt engineering methods varied in effectiveness, with in-context learning enhancing knowledge graph tasks and question answering.
High factual consistency confirmed in models like GPT-4, while limitations were noted in GPT-3.5.

Abstract

Geological reports contain abundant domain-specific knowledge and unstructured textual data, presenting challenges in extracting meaningful information for engineering decision-making. Recent advancements in large language models (LLMs) offer promising solutions. This study benchmarks eight state-of-the-art LLMs on two key tasks—knowledge graph (KG) construction and question answering (QA)—which are crucial for extracting and structuring information from extensive unstructured geological text, thereby supporting risk assessment. We conduct a thorough evaluation of both proprietary and open-source models, utilizing advanced prompt engineering techniques such as in-context learning (ICL), chain-of-thought (CoT), and the proposed knowledge-injected (KI) strategies. The results indicate that, in the zero-shot setting, DeepSeek-V3 excels in KG construction, while DeepSeek-R1 outperforms other models in QA tasks. Prompt engineering exhibited varying impacts: ICL enhanced the overall performance of KG tasks and the precision score of QA-factoid tasks; KI improved the exact match in KG but did not significantly affect the matching score based on semantic similarity, and CoT boosted QA precision through step-by-step reasoning. Human evaluation confirms high factual consistency in models like GPT-4, while others, such as GPT-3.5, exhibit limitations. To enhance practical applicability, we have developed an open-source, interactive platform that integrates all benchmarked LLMs and prompt strategies, facilitating real-time analysis of unstructured geological texts for researchers. Despite these advancements, challenges such as hallucinations and domain-specific comprehension remain. Our findings emphasize the potential of LLMs in geological text analysis while also highlighting the need for further refinement to ensure their reliability in geological risk management applications.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper