In our previous work, we developed a CCSD(T)-level range-separated water force field that combines the power of physics-driven and machine learning models. However, it was found that expensive CCSD(T)/CBS calculations lead to a limited number of QM data as well as the missing of force labels, both of which lead to training instability issues. Bulk properties show large variations that cannot be resolved by simply reducing the fitting error in a small cluster QM dataset. Such an instability in bulk phase simulation is a universal problem in the training of machine learning potentials and is particularly severe at the CCSD(T) level of theory. In this work, using our range-separated water model as an example, we aim to overcome these limitations by developing a new training workflow. It is composed of several techniques, including (1) an active learning protocol that ensures more thorough sampling in different temperatures and densities, (2) an intermediate force label technique employing a machine learning density functional, and (3) an ensemble knowledge distillation method. These techniques significantly stabilize the resulting water model, consistently achieving sub-chemical accuracies in both cluster energies and experimental properties. Benchmarks are carried out for various properties, including densities, radial distribution functions, dielectric constants, diffusivity, and infrared spectra, all showing state-of-the-art performances and proving the effectiveness of the training protocol.
Gao et al. (Thu,) studied this question.