Summary Data quality, feature engineering, and model generalization are key challenges in applying machine learning to high-performance materials design. Here, we report a framework addressing these challenges using thermoelectric materials as a case study. Data are collected and curated from the Starrydata2 database, followed by multi-step feature engineering, including construction, selection, and optimization, to obtain a physically meaningful subset. By benchmarking against mainstream regression models, with independent external dataset testing and model interpretability analysis, a modified tabular prior-data fitted network (TabPFN) model (model I) demonstrates superior accuracy and generalization in predicting the thermoelectric figure of merit (ZT). Model I is applied to halide double perovskites from the Materials Project database, identifying candidates including Rb2CuSbCl6, Cs2AgAuCl6, and Rb2CuBiCl6. First-principles calculations validate their thermoelectric properties, with n-type Cs2AgAuCl6 achieving ZTmax = 1.64 at 800 K. These results highlight the potential of a data-driven and computationally synergistic approach for discovering high-performance thermoelectric materials.
Sun et al. (Sun,) studied this question.