Supervised machine learning (SML) has woven itself into the very fabric of material discovery, as it offers computationally cheaper ways of correlating the structure of a material with its properties, which, in SML's absence, would require high fidelity, resource intensive first principles calculations. The performance of SML models is strongly influenced by the quantity of available training data. In general, an increase in the amount of training data leads to an improvement in model accuracy. When adequately trained, these SML models act as effective low fidelity surrogate models for accelerating materials discovery, aligning with the broader objective of computational materials science, which is the identification of high‐performing materials for a variety of target applications. In this work, we recognize the importance of data driven model accuracy and introduce a novel framework for constructing SML models aimed at identifying top performing materials for gas separation applications. Our approach embraces the challenge of data scarcity, seeking to discover as many high performing candidates as possible while relying on minimal training data. We demonstrate that our iterative framework for building SML models reduces the required training dataset to only 5%–10% of the total data, while successfully identifying up to 97 of the top 100 best performing materials. Furthermore, we show that this framework is weakly SML model dependent, exhibiting minimal dependence on the specific target property under investigation. Leveraging this innovative approach, we identify top performing candidates for three industry relevant gas separations in multiple metal organic framework databases, thereby highlighting the robustness and general applicability of our workflow.
Daoo et al. (Tue,) studied this question.