March 3, 2026

In Silico Prediction of Multicomponent Functional Material Formulations via Machine Learning Coupled with Molecular Simulation: A Case Study on Cleansing Foam Formulations

Key Points

Cleansing foam formulations improved R2 values from 0.665 to 0.716, indicating enhanced predictive accuracy.
Utilizing 430 historical recipes, the approach involved nested cross-validation across various generalization regimes.
The method integrates machine learning with molecular simulation, enhancing the design process while ensuring interpretability.
Tree-based ensembles demonstrated superior performance, emphasizing the complexity of composition–property relationships.

Abstract

Multi-ingredient cleansing foams pose a combinatorial design challenge because many component ratios must be experimentally screened. We integrate DPD-derived descriptors with machine learning to enable prescreening and prioritization of formulations, thereby reducing exploratory batches and accelerating design cycles while maintaining a traceable physical rationale. The modeling descriptors─the hydrophilic fraction (fHI) and the solubility parameter contrast (Δδ) defined relative to the polyethylene glycol thresholds─probe amphiphilicity, while DPD-derived potential energy and pressure summarize mesoscale self-assembly. Using 430 historical recipes, we benchmark nested cross-validation under three generalization regimes: Points Out (random formulations), Mixtures Out (novel combinations of known ingredients), and Compounds Out (novel raw ingredients; polyols, including humectants and amphiphilic derivatives, in this data set). To prevent leakage, all preprocessing (imputation and scaling) is fit strictly on training folds only, and identical outer-CV partitions are held across feature conditions to enable paired comparisons. Incorporating modeling and simulation descriptors improves mean R2 from 0.665 to 0.716 (Points Out) and from 0.420 to 0.573 (Mixtures Out), and raises Compounds Out R2 from 0.023 to 0.341. Paired difference tests with HC3-robust OLS and Holm correction confirm statistically significant gains─small to moderate for Points Out and moderate to large for Mixtures and Compounds Out. Among algorithms, tree-based ensembles outperform linear, kernel, and neural baselines, reflecting nonlinear composition–property relations. This workflow operationalizes AI-assisted formulation design by triaging candidate recipes prior to wet-lab screening, enabling faster decision-making and tangible experimental savings while retaining physical interpretability via DPD-derived descriptors. Compounds out results apply only to polyols in the present data set; generalization beyond polyols is out-of-scope and will require larger, more diverse data sets and transfer learning.

Bookmark

In Silico Prediction of Multicomponent Functional Material Formulations via Machine Learning Coupled with Molecular Simulation: A Case Study on Cleansing Foam Formulations

Key Points

Abstract

Cite This Study

Also Consider

Also Consider