Multi-ingredient cleansing foams pose a combinatorial design challenge because many component ratios must be experimentally screened. We integrate DPD-derived descriptors with machine learning to enable prescreening and prioritization of formulations, thereby reducing exploratory batches and accelerating design cycles while maintaining a traceable physical rationale. The modeling descriptors─the hydrophilic fraction (fHI) and the solubility parameter contrast (Δδ) defined relative to the polyethylene glycol thresholds─probe amphiphilicity, while DPD-derived potential energy and pressure summarize mesoscale self-assembly. Using 430 historical recipes, we benchmark nested cross-validation under three generalization regimes: Points Out (random formulations), Mixtures Out (novel combinations of known ingredients), and Compounds Out (novel raw ingredients; polyols, including humectants and amphiphilic derivatives, in this data set). To prevent leakage, all preprocessing (imputation and scaling) is fit strictly on training folds only, and identical outer-CV partitions are held across feature conditions to enable paired comparisons. Incorporating modeling and simulation descriptors improves mean R2 from 0.665 to 0.716 (Points Out) and from 0.420 to 0.573 (Mixtures Out), and raises Compounds Out R2 from 0.023 to 0.341. Paired difference tests with HC3-robust OLS and Holm correction confirm statistically significant gains─small to moderate for Points Out and moderate to large for Mixtures and Compounds Out. Among algorithms, tree-based ensembles outperform linear, kernel, and neural baselines, reflecting nonlinear composition–property relations. This workflow operationalizes AI-assisted formulation design by triaging candidate recipes prior to wet-lab screening, enabling faster decision-making and tangible experimental savings while retaining physical interpretability via DPD-derived descriptors. Compounds out results apply only to polyols in the present data set; generalization beyond polyols is out-of-scope and will require larger, more diverse data sets and transfer learning.
Hamaguchi et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: