What question did this study set out to answer?

February 13, 2026Open Access

Data-driven discovery of digital twins in biomedical research

Key Points

The study aims to explore methodologies for automatically inferring digital twins from biological datasets and to identify challenges in the process.
Reviewed 177 methodologies for inferring digital twins from biological time series.
Evaluated algorithms based on biological and methodological challenges.
Explored the application of sparse regression and symbolic regression techniques.
Sparse regression generally outperformed symbolic regression, especially with Bayesian frameworks.
Deep learning and large language models show promise but require improved reliability.
No single method addresses all challenges; hybrid frameworks are recommended for future development.

Abstract

Abstract Recent technological advances have expanded the availability of high-throughput biological datasets, opening the way to the reliable design of digital twins of biomedical systems or patients. Such computational tools represent key chemical reaction networks driving perturbation or drug response and can profoundly guide drug discovery and personalized therapeutics. Yet, their development still depends on laborious data integration by the human modeler, so that automated approaches are critically needed. The successes of data-driven system discovery in Physics, rooted in clean datasets and well-defined governing laws, have fueled interest in applying similar techniques in Biology, which presents unique challenges. Here, we reviewed 177 methodologies for automatically inferring digital twins from biological time series, which mostly involved symbolic or sparse regression, and recapitulated them in a Shiny app. We evaluated algorithms according to eight biological and methodological challenges, associated with integrating noisy/incomplete data, multiple conditions, prior knowledge, latent variables, or dealing with high dimensionality, unobserved variable derivatives, candidate library design, and uncertainty quantification. Upon these criteria, sparse regression generally outperformed symbolic regression, particularly when using Bayesian frameworks. Next, deep learning and large language models further emerge as innovative tools to integrate prior knowledge, although their reliability and consistency need to be improved. While no single method addresses all challenges, we argue that progress in learning digital twins will come from hybrid and modular frameworks combining chemical reaction network-based mechanistic grounding, Bayesian uncertainty quantification, and the generative and knowledge integration capacities of deep learning. To support their development, we further highlight key components required for future benchmark development to evaluate methods across all challenges.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Métayer et al. (Tue,) studied this question.

synapsesocial.com/papers/698ebf9785a1ff6a93016f4f https://doi.org/https://doi.org/10.1093/bib/bbaf722

Bookmark

View Full Paper