• Four predictability classes identified with distinct sample size and performance • Using informative wavenumbers improves accuracy and reduces required sample size • Reaching optimal performance requires data from more than ten farms • Milk trait models reach a performance ceiling where more samples add no accuracy • We provide 4 regression trees to guide model development Machine learning has been used to predict multiple traits and milk components directly from individual cow milk samples using Fourier transform infrared (FTIR) spectroscopy. Over the years, numerous models have been developed using various configurations, including the number of farms, number of samples per farm, total sample size, and different spectral pre-processing methods. Evaluating how these choices impact model performance is critical for building robust models. In this study, we propose a framework to guide the selection of optimal configurations for FTIR-based machine learning (FTML) models. Using a dataset of 407,632 individual milk samples from 3,408 farms, we tested 144 training configurations that varied in the number of farms (2 to 100) and the number of samples per farm (10 to 100), resulting in training set sizes ranging from 20 to 10,000 samples. We further evaluated six spectral pre-processing strategies, including the use of first derivative and informative wavenumber selection, across 17 target variables (e.g., milk traits like fat, protein, fatty acids, SCC, and a noisy negative control). This yielded 14,688 FTML models, all tested on a common independent test set of 800 samples from 20 unseen farms. Our results revealed four distinct predictability classes of target variables, each with different optimal sample size requirements and performance. Selecting informative wavenumbers consistently improved R 2 by 0.10, 0.28, 0.32, and 0.05 in classes 1 to 4, respectively and reduced the number of samples needed across all classes to obtain maximal performance. We provide four regression trees to support FTML model development and show that using data from more than ten farms is necessary to achieve optimal performance.
Touil et al. (Sun,) studied this question.