What question did this study set out to answer?

The aim is to assess the effects of feature selection and various cross-validation strategies on prediction accuracy for ultrafine particles.

February 8, 2026

Bridging the Gap Between Data Reproduction and Prediction: The Impact of Feature Selection and Cross-Validation Strategies on Prediction of Ambient Ultrafine Particles Collected with Mobile Monitoring

Key Points

The aim is to assess the effects of feature selection and various cross-validation strategies on prediction accuracy for ultrafine particles.
Utilized mobile ultrafine particle data from Toronto
Compared land-use regression models with different cross-validation strategies
Applied forward feature selection to optimize model hyperparameters
Evaluated model performance against a hold-out test set and stationary backyard measurements
Spatiotemporal cross-validation with feature selection reduced average percentage error from ∼217% to ∼79%
Models with random cross-validation showed overfitting and poor performance on independent samples
Proper model alignment with data structure improved reliability and prediction accuracy

Abstract

Reliable exposure assessment is vital for epidemiological research, but weaknesses in land-use regression (LUR) models undermine its validity. Using mobile ultrafine particle (UFP) data in Toronto, we compared LUR models trained under random, spatial, temporal, and spatiotemporal cross-validation (CV), with and without forward feature selection (FFS). Model hyperparameters and feature subsets were optimized within each CV scheme. Spatial CV folds were designed at fine scales to reflect UFP autocorrelation. Each approach was evaluated on a hold-out test set, across CV schemes, and against independent stationary backyard measurements. Models based on spatiotemporal CV coupled with FFS were able to reduce overfitting, improve generalization, and produce stable exposure surfaces. These surfaces avoided the spatial artifacts and exaggerated variable effects typically seen in models trained with random CV. Models tuned with random CV overfit, performed poorly on independent samples, and were sensitive to outliers. The average percentage error (APE) decreased from ∼217% for a model with random-CV to ∼79% with spatiotemporal CV and FFS. Our findings demonstrate that proper alignment of model design with the data's spatiotemporal structure and modeling objective ensures reliability, minimizes data reproduction, and enables true prediction.

Bookmark

Bridging the Gap Between Data Reproduction and Prediction: The Impact of Feature Selection and Cross-Validation Strategies on Prediction of Ambient Ultrafine Particles Collected with Mobile Monitoring

Key Points

Abstract

Cite This Study