Molecular property prediction is crucial for drug discovery in biopharmaceuticals since it helps identify promising compounds, optimizing the efficacy of developing new therapies. Despite its importance, existing deep learning-based methods for this task are often incongruous with fundamental chemical properties. Here we show that an unsupervised pretraining approach, Molecular Motif Learning (MotiL), learns molecular representations that preserve both whole-molecule structure and motif-level information directly from native molecular graphs. MotiL produces representations that group small molecules sharing a common core structure (i.e., scaffold) and proteins with related three-dimensional structures and functions. We evaluated MotiL on at least 16 molecule benchmarks, and uncovered that it captures analogous graph representations not only for small molecules with the same scaffold but also for protein macromolecules with similar structures and overlapping chemical functions such as tRNA binding. These informative representations empower MotiL to surpass the accuracy of state-of-the-art contrastive or predictive methods in the prediction of molecular properties like blood brain barrier permeability.
Wang et al. (Thu,) studied this question.