Utilizing Cone Penetration Test (CPT) data to predict shear wave velocity (Vₛ) is widely practiced. This study compares Vₛ predictions using Machine Learning (ML) models to conventional regression for a global archive of seismic piezocone data. Prediction strategies are evaluated under grouped, nested cross-validation to emulate deployment to new soundings and sites. The database is curated into five practice-reflective feature regimes and analyzed with decision trees (DT), random forest (RF), gradient boosting (GB), support vector regression (SVR), and a shallow neural network (NN). Corrected tip resistance (qₜ) emerges as the backbone predictor while pore pressure (u₂) is the highest value complement in clays. Tree ensembles attain the highest test accuracy in d+qₜ+u₂ regimes. SVR and NN are competitive when u₂ is present. Steep early improvements in predictions were observed with adding clay-rich records, which plateaued beyond a few hundred samples. Relative to empirical correlations, flexible models improve R² by roughly 0. 04 to 0. 1 for feature sets that capture drainage or stress-history interactions, while regressions remain competitive in the austere qₜ-only regime. We conclude through a practical lens: prioritize qₜ+u₂ acquisition where feasible, expand archives when sensors are limited, and apply group-driven validation to secure transferable, bias-controlled estimates of Vₛ.
Hu et al. (Thu,) studied this question.