February 28, 2024Open Access

Data leakage inflates prediction performance in connectome-based machine learning models

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract Predictive modeling is a central technique in neuroimaging to identify brain-behavior relationships and test their generalizability to unseen data. However, data leakage undermines the validity of predictive models by breaching the separation between training and test data. Leakage is always an incorrect practice but still pervasive in machine learning. Understanding its effects on neuroimaging predictive models can inform how leakage affects existing literature. Here, we investigate the effects of five forms of leakage–involving feature selection, covariate correction, and dependence between subjects–on functional and structural connectome-based machine learning models across four datasets and three phenotypes. Leakage via feature selection and repeated subjects drastically inflates prediction performance, whereas other forms of leakage have minor effects. Furthermore, small datasets exacerbate the effects of leakage. Overall, our results illustrate the variable effects of leakage and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Matthew Rosenblatt

Yale University

Link Tejavibulya

Cornell University

Rongtao Jiang

Beijing Normal University

Journals

Nature Communications

Actions

Institutions

Yale University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Data leakage inflates prediction performance in connectome-based machine learning models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider