The Tabular Prior-Data Fitted Network (TabPFN) is a foundation model, a pretrained, transformer-based neural network, designed for prediction tasks on tabular data. Although TabPFN has demonstrated strong performance relative to state-of-the-art baselines, its generalisability to soil spectral datasets of varying sizes remains unclear. This study evaluates the performance of TabPFN and compares it with partial least squares regression (PLSR), Cubist, and convolutional neural network (CNN) for soil spectral analysis using mid-infrared (MIR) spectroscopy. Soil samples from the Kellogg Soil Survey Laboratory were used to predict three soil properties: total carbon (TC), pH, and Olsen method extractable phosphorus (Olsen-P), representing high, medium and low predictability. An internal dataset from Texas (N = 620) and an external dataset from eastern Australia (N = 387) were used for testing. Models were trained using datasets of varying sizes and spectral similarity to the test sets. TabPFN achieved higher accuracy than all the baseline models in most results, with an average RMSE reduction of 74% relative to PLSR and 39% relative to Cubist when predicting TC. Performance gains were particularly pronounced when trained on medium-sized datasets, and TabPFN also exhibited superior generalisability across spectrally distinct training and test data. Soil property predictability influenced model performance across all models, with higher accuracy for TC than for pH and Olsen-P. For uncertainty quantification, TabPFN produced prediction intervals that closely matched the expected coverage in the external test set, indicating reasonable uncertainty generalisation, although quantile calibration was less reliable. Shapley additive explanations (SHAP) revealed that the wavenumbers contributed to the prediction corresponded to known spectral signatures of soil organic and inorganic carbon, supporting the interpretability of TabPFN. Overall, TabPFN demonstrated high predictive accuracy, improved generalisability compared to conventional methods, and useful uncertainty estimates, highlighting its potential for application to soil spectral libraries, for both large and small sizes.
Huang et al. (Sat,) studied this question.