What question did this study set out to answer?

The aim is to provide a comprehensive dataset for molecular representation learning to improve predictions of molecular properties.

May 28, 2026Open Access

A dataset of 1.2 million molecules with DFT-level quantum chemical annotations for molecular representation learning

Key Points

The aim is to provide a comprehensive dataset for molecular representation learning to improve predictions of molecular properties.
Introduced the qcMol dataset comprising 1.2 million molecules with DFT-level annotations.
Utilized quantum descriptors calculated via B3LYP-D3/def2-SV(P)//GFN2-xTB.
Organized data into multiple formats for integration into various learning frameworks.
qcMol supports pre-training and benchmarking for machine learning models in drug discovery.
Covers 247,448 types of scaffolds and a wide range of molecular sizes.
Improves the available data for molecular representation learning in computational chemistry.

Abstract

An informative molecular representation is prerequisite for the accurate prediction of molecular property by machine learning, but demands large-scale data enriched with detailed physicochemical information for its effective learning. Here, we introduce qcMol, a dataset consisting of 1.2 million molecules with DFT-level quantum chemical annotations, to facilitate molecular representation learning. Chemicals in this dataset include drug-like compounds, metabolites and molecules with matched experimental data, covering 247,448 kinds of scaffolds and a broad spectrum of molecular sizes. Each compound in qcMol is annotated with multiple quantum descriptors, obtained through reliable quantum chemical calculations at the level of B3LYP-D3/def2-SV(P)//GFN2-xTB as well as the follow-up wave function post-analysis. These features are organized into multiple formats, allowing for flexible integration into diversified molecular representation learning frameworks. qcMol can serve as not only the pre-training resource but also the benchmark test set for machine learning models, benefiting the practical in silico drug discovery. Advancements in deep learning have revolutionized drug and material design, yet the scarcity of large-scale datasets with detailed physicochemical data limits molecular representation learning. Here, the authors introduce qcMol, a dataset of 1.2 million molecules with DFT-level quantum chemical annotations and flexible data formats, serving as the pre-training resource as well as the benchmark test set for machine learning models.

Mark Helpful

Bookmark

Relay

View Full Paper