What question did this study set out to answer?

The review aims to explore the evolution and current state of methods for extracting chemical information from literature, focusing on automation.

February 25, 2026

Use of Machine Learning and Large Language Models in Chemical Information Extraction

Key Points

The review aims to explore the evolution and current state of methods for extracting chemical information from literature, focusing on automation.
Review of recent advances in chemical information extraction from literature.
Examination of both text and image modalities in chemical data.
Comparison of rule-based, machine learning, and large language model approaches.
Identification of core tasks such as optical chemical structure recognition and named entity recognition.
Discussion of practical challenges including multimodal integration and data annotation.
Highlighting key trends and limitations in current extraction frameworks.

Abstract

Rich information in the chemical literature presents unprecedented opportunities for accelerating discovery and optimization in chemistry through data-driven approaches. Nevertheless, converting raw information in the literature into structured databases relies primarily on manual curation, which is time-consuming and costly. In this review, we comprehensively examine recent advances in automatic chemical information extraction from the literature, focusing on image and text modalities. We trace the evolution from early rule-based and machine learning approaches to state-of-the-art methods leveraging large language models (LLMs) and vision language models. We discuss core tasks such as optical chemical structure recognition, reaction diagram parsing, named entity recognition, and experimental procedure extraction, highlighting representative methods, benchmark data sets, and practical challenges such as multimodal integration and data annotation. By systematically comparing these approaches, we identify key trends and persistent limitations and outline promising future directions toward robust, scalable, and automated chemical information extraction frameworks. This review aims to provide a practical guide for researchers seeking to harness machine learning and LLM technologies to accelerate the digital transformation of chemical science.

Bookmark

Use of Machine Learning and Large Language Models in Chemical Information Extraction

Key Points

Abstract

Cite This Study