What question did this study set out to answer?

The aim is to optimize FTIR-based machine learning models for predicting milk traits by assessing sample size and spectral pre-processing effects.

March 27, 2026Open Access

Optimizing FTIR-Based Machine Learning Models from Cow Milk: A New Framework to Evaluate Sample Size, Farm Number, and Spectral Pre-Processing

Puntos clave

The aim is to optimize FTIR-based machine learning models for predicting milk traits by assessing sample size and spectral pre-processing effects.
Analyzed 407,632 individual milk samples from 3,408 farms.
Tested various configurations with 144 training sets, varying sample size and number of farms.
Evaluated six spectral pre-processing methods across 17 target variables.
Developed four regression trees to guide model development.
Identified four distinct predictability classes, each with unique sample size requirements.
Found that using informative wavenumbers improved accuracy by up to 0.32 R².
Showed that optimal performance requires data from more than ten farms.
Found no further accuracy improvements with excessive sample sizes beyond a certain point.

Resumen

• Four predictability classes identified with distinct sample size and performance • Using informative wavenumbers improves accuracy and reduces required sample size • Reaching optimal performance requires data from more than ten farms • Milk trait models reach a performance ceiling where more samples add no accuracy • We provide 4 regression trees to guide model development Machine learning has been used to predict multiple traits and milk components directly from individual cow milk samples using Fourier transform infrared (FTIR) spectroscopy. Over the years, numerous models have been developed using various configurations, including the number of farms, number of samples per farm, total sample size, and different spectral pre-processing methods. Evaluating how these choices impact model performance is critical for building robust models. In this study, we propose a framework to guide the selection of optimal configurations for FTIR-based machine learning (FTML) models. Using a dataset of 407,632 individual milk samples from 3,408 farms, we tested 144 training configurations that varied in the number of farms (2 to 100) and the number of samples per farm (10 to 100), resulting in training set sizes ranging from 20 to 10,000 samples. We further evaluated six spectral pre-processing strategies, including the use of first derivative and informative wavenumber selection, across 17 target variables (e.g., milk traits like fat, protein, fatty acids, SCC, and a noisy negative control). This yielded 14,688 FTML models, all tested on a common independent test set of 800 samples from 20 unseen farms. Our results revealed four distinct predictability classes of target variables, each with different optimal sample size requirements and performance. Selecting informative wavenumbers consistently improved R 2 by 0.10, 0.28, 0.32, and 0.05 in classes 1 to 4, respectively and reduced the number of samples needed across all classes to obtain maximal performance. We provide four regression trees to support FTML model development and show that using data from more than ten farms is necessary to achieve optimal performance.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Touil et al. (Sun,) studied this question.

synapsesocial.com/papers/69c61fa915a0a509bde18239 https://doi.org/https://doi.org/10.1016/j.atech.2026.102043

Me gusta

Guardar

Ver artículo completo