Rich information in the chemical literature presents unprecedented opportunities for accelerating discovery and optimization in chemistry through data-driven approaches. Nevertheless, converting raw information in the literature into structured databases relies primarily on manual curation, which is time-consuming and costly. In this review, we comprehensively examine recent advances in automatic chemical information extraction from the literature, focusing on image and text modalities. We trace the evolution from early rule-based and machine learning approaches to state-of-the-art methods leveraging large language models (LLMs) and vision language models. We discuss core tasks such as optical chemical structure recognition, reaction diagram parsing, named entity recognition, and experimental procedure extraction, highlighting representative methods, benchmark data sets, and practical challenges such as multimodal integration and data annotation. By systematically comparing these approaches, we identify key trends and persistent limitations and outline promising future directions toward robust, scalable, and automated chemical information extraction frameworks. This review aims to provide a practical guide for researchers seeking to harness machine learning and LLM technologies to accelerate the digital transformation of chemical science.
Chen et al. (Mon,) studied this question.