January 1, 2021Open Access

SPlit: An Optimal Method for Data Splitting

Key Points

Key points are not available for this paper at this time.

Abstract

In this article, we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of support points (SP), which was initially developed for finding the optimal representative points of a continuous distribution. We adapt SP for subsampling from a dataset using a sequential nearest neighbor algorithm. We also extend SP to deal with categorical variables so that SPlit can be applied to both regression and classification problems. The implementation of SPlit on real datasets shows substantial improvement in the worst-case testing performance for several modeling methods compared to the commonly used random splitting procedure.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Joseph et al. (Fri,) studied this question.

synapsesocial.com/papers/6a130f5bc031bb6829a7cc25 — DOI: https://doi.org/10.6084/m9.figshare.14501694.v1

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Classification and Regression by randomForest· 2007 · 18,407 citations
Knowledge-Based Systems· 1990 · 17 citations
Projected support points: a new method for high-dimensional data\n reduction· 2017 · 2 citations
Validation of Regression Models: Methods and Examples· 1977 · 190 citations
Random Number Generation and Quasi-Monte Carlo Methods· 1992 · 2,726 citations

Authors

V. Roshan Joseph

Georgia Institute of Technology

Akhil Vakayil

Georgia Institute of Technology

Actions

Institutions

Georgia Institute of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Classification and Regression by randomForest· 2007 · 18,407 citations
Knowledge-Based Systems· 1990 · 17 citations
Projected support points: a new method for high-dimensional data\n reduction· 2017 · 2 citations
Validation of Regression Models: Methods and Examples· 1977 · 190 citations
Random Number Generation and Quasi-Monte Carlo Methods· 1992 · 2,726 citations

SPlit: An Optimal Method for Data Splitting

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider