Machine Learning is increasingly embedded in critical software systems, making their quality assurance a matter of growing concern. While the research community has proposed several techniques for testing ML-enabled systems, there is limited empirical evidence on whether these techniques are adopted in practice or align with developers’ testing workflows. This paper presents a two-step empirical investigation aimed at characterizing the current landscape of ML testing in real-world development. Our goal is to understand how developers approach testing, whether proposed techniques are adopted, and what barriers hinder their implementation. We designed a mixed-method study that triangulates insights from two complementary sources: (1) a mining study of 398 open-source repositories to analyze implemented testing strategies and tool usage; and (2) a survey of 100 practitioners to capture perceptions, motivations, and practical challenges. Our findings reveal that developers rely heavily on foundational strategies like Smoke Testing and Rule-Based Checking , implemented through custom testing logic built on general-purpose libraries (e.g., PyTest , NumPy ). Conversely, we identified a critical adoption gap in specialized tools and advanced techniques such as Metamorphic Testing , which are rarely implemented despite their academic prominence. Our survey indicates that this gap is driven by practical barriers, including high integration costs and a poor fit with existing developer workflows. These findings suggest that future research and tooling must prioritize usability, integration, and a clearer alignment with the pragmatic needs of developers. • Large-scale mixed-method investigation of ML testing practices in real-world development. • Triangulated insights from 398 open-source repositories (2, 018 test files) and 100 practitioners. • Practitioners rely on foundational strategies like Smoke Testing, implemented via custom solutions. • Critical adoption gap for specialized tools and advanced techniques due to workflow integration barriers. • Released datasets, analysis scripts, and a technical report to enable replication.
Cannavale et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: