Los puntos clave no están disponibles para este artículo en este momento.
Abstract Hydrocarbon exploration and carbon capture and storage (CCS) evaluation are inherently multi-disciplinary tasks that require the integration of knowledge from multiple datatypes set in a historical and geological context. The diverse nature of subsurface data is often represented by a combination of direct and indirect measurements, interpretations and observations documented in multi-dimensional datasets as images, and written reports. The fidelity of these images and reports can have an enormous variety, and different qualities leading to a challenging situation where explorationists need to determine the value of a source of information while combining these sources across large spatiotemporal contexts. Modern search engines today can not only search through document text but also images. These capabilities have improved our ability to find well-known concepts based on short phrases, or keywords, combined with significant meta-data. While these types of search engines have certainly benefited practitioners, the challenge of combining information from multiple data-sources, data modalities and languages remains an open problem. With the advent of conversational large language model (LLM) systems such as ChatGPT (Achiam et al. 2023) that provide coherent textual information and are informed by their training data, have become a reality. While ChatGPT certainly has taken many industries and their disciplines by storm, the tool is not without its shortcomings. For industry applications, in many cases the information necessary to provide answers will be highly proprietary, not shared with third parties and not part of the training data of the popular LLMs. Furthermore, due to their probabilistic nature LLMs suffer from so-called hallucinations, where the model provides a confident answer based on the user provided input but is non-factual and often non-sensical. To answer a given user-query with factuality it is important to provide relevant information as context to the LLMs. Lewis et al. (2020) proposes combining two systems: An information retrieval system that provide relevant information to answer a given question or to solve a specific task, and a second system being an LLM that is supplemented with the retrieved information as context to answer the user's question. This pattern of so-called retrieval-augmented generation (RAG) has become highly popular in the last year due to the strong conversational capabilities of systems like ChatGPT, accessible developer APIs for interfacing with LLMs, open-source software to orchestrate RAG-systems, as well as the rapid development of open-source LLMs (Touvron et al. 2023). Moreover, since the RAG pattern does not require fine-tuning or re-training a language model, it remains one of the most accessible ways to tailor LLMs to proprietary knowledge bases.
Mosser et al. (Wed,) studied this question.