Abstract This article examines how data readiness for AI principles apply to large scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains—climate, nuclear fusion, life sciences, and materials—to identify common preprocessing patterns and domain‐specific constraints. We introduce a two‐dimensional readiness model that combines canonical preprocessing patterns with a five‐level operational readiness scale, both tailored to high‐performance computing (HPC) environments. This construct helps outline key challenges in transforming large‐scale scientific data into formats suitable for scalable AI training. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross‐domain support for scalable and reproducible AI for science. Finally, we evaluate this maturity matrix in the context of case studies including ClimaX (climate), AFLOW (materials), OpenFold (proteomics), and DIII‐D fusion disruption‐prediction workflows, from which we distill lessons learned and provide recommendations to guide practitioners in developing robust AI‐readiness pipelines. Finally, we discuss remaining cross‐cutting challenges that persist across scientific domains.
Brewer et al. (Sun,) studied this question.