What question did this study set out to answer?

The aim is to automate the creation of domain-specific databases from unstructured scientific literature using AI.

April 4, 2026Open Access

AI-Driven methodology for mining scientific literature and extracting databases: A case study on a chemical process

Puntos clave

The aim is to automate the creation of domain-specific databases from unstructured scientific literature using AI.
Developed an AI-driven methodology with three LLM-based stages: literature identification, paper classification, and data extraction.
Validated the approach through a case study on the CO hydrogenation process.
Extracted and structured data from over 1,000 scientific papers into a database.
Achieved >99% F1 score for extracting text and tables from papers.
Achieved an 85.8% F1 score for extracting data from visual figures.
Constructed a tabular database with over 9,600 entries related to catalytic performance.

Resumen

• Introduced a generic AI-driven methodology for constructing scientific databases. • LLM-based pipeline converts unstructured papers into structured data. • The methodology was demonstrated and validated on a case study on a chemical process. • Extraction accuracy: text and tables >99% F1 score, 85.8% for visual figures. Scientific papers are a primary source of knowledge for research and development, yet the information they contain is highly unstructured as it being spread across text, tables, and figures in inconsistent formats. Structuring this dispersed information into a coherent database is a central informatics challenge. In this work, we present a generic artificial intelligence (AI)-driven methodology to automate the construction of domain-specific databases from scientific literature. The methodology includes three Large Language Model (LLM)-based stages: identifying domain-specific literature, classifying papers by relevance, and extracting the scientific data they contain. We demonstrate and validate this approach in a case study related to chemical reactions associated with environmental issues by constructing a catalytic performance database for the CO hydrogenation process. Data from over 1,000 disparate papers were extracted and transformed into a tabular database, containing over 9,600 entries. The extraction performance proved highly effective, achieving an F1 score of >99% for text and tables, and 85.8% for figures. This methodology can be easily adapted and implemented in other fields.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo