• Introduced a generic AI-driven methodology for constructing scientific databases. • LLM-based pipeline converts unstructured papers into structured data. • The methodology was demonstrated and validated on a case study on a chemical process. • Extraction accuracy: text and tables >99% F1 score, 85.8% for visual figures. Scientific papers are a primary source of knowledge for research and development, yet the information they contain is highly unstructured as it being spread across text, tables, and figures in inconsistent formats. Structuring this dispersed information into a coherent database is a central informatics challenge. In this work, we present a generic artificial intelligence (AI)-driven methodology to automate the construction of domain-specific databases from scientific literature. The methodology includes three Large Language Model (LLM)-based stages: identifying domain-specific literature, classifying papers by relevance, and extracting the scientific data they contain. We demonstrate and validate this approach in a case study related to chemical reactions associated with environmental issues by constructing a catalytic performance database for the CO hydrogenation process. Data from over 1,000 disparate papers were extracted and transformed into a tabular database, containing over 9,600 entries. The extraction performance proved highly effective, achieving an F1 score of >99% for text and tables, and 85.8% for figures. This methodology can be easily adapted and implemented in other fields.
Sror et al. (Wed,) studied this question.