An informative molecular representation is prerequisite for the accurate prediction of molecular property by machine learning, but demands large-scale data enriched with detailed physicochemical information for its effective learning. Here, we introduce qcMol, a dataset consisting of 1.2 million molecules with DFT-level quantum chemical annotations, to facilitate molecular representation learning. Chemicals in this dataset include drug-like compounds, metabolites and molecules with matched experimental data, covering 247,448 kinds of scaffolds and a broad spectrum of molecular sizes. Each compound in qcMol is annotated with multiple quantum descriptors, obtained through reliable quantum chemical calculations at the level of B3LYP-D3/def2-SV(P)//GFN2-xTB as well as the follow-up wave function post-analysis. These features are organized into multiple formats, allowing for flexible integration into diversified molecular representation learning frameworks. qcMol can serve as not only the pre-training resource but also the benchmark test set for machine learning models, benefiting the practical in silico drug discovery. Advancements in deep learning have revolutionized drug and material design, yet the scarcity of large-scale datasets with detailed physicochemical data limits molecular representation learning. Here, the authors introduce qcMol, a dataset of 1.2 million molecules with DFT-level quantum chemical annotations and flexible data formats, serving as the pre-training resource as well as the benchmark test set for machine learning models.
Wang et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: