The accuracy of macrobiological community predictions largely depends on the taxonomic scale considered. Nowadays, the applicability of such predictions remains an important challenge when extended to microbial soil communities. This is not only due to the lack of reliable benchmark data, but also to a greater diversity of the soil microorganisms compared to other environments. In this study, we use six traditional machine learning regression models and one deep learning regressor to predict relative frequencies of bacterial and fungal communities within the soil microbiome based on environmental factors. We analyze the data from two publicly available soil microbiome datasets: (1) Data collected by Averill and co-authors and analyzed in a recent Nature Ecology and Evolution article, and (2) Data extracted from the NEON database, to estimate the composition of bacterial and fungal communities at the functional (i. e. functional group level) and taxonomic scales (i. e. phylum, class, order, family, and genus levels). Our findings suggest the presence of a general pattern across the observed taxonomic scales according to which the predictability of the soil microbiome increases with taxonomic scale. However, a notable exception occurs when machine learning models are applied to predict bacterial communities at the functional group level for Averill et al. ’s data when all of them fail to provide accurate predictions results. The best overall results obtained include the value of the coefficient of determination R²=0. 57 at the phylum taxonomic level, provided by the Gradient boosting model for the bacterial dataset collected by Averill et al. , and the value of R²=0. 45 at the functional group level, provided by both the Random forest and Gradient boosting models for the fungal dataset extracted from NEON. The best overall predictions were obtained using the Random forest and k-NN models. Random forest yielded the best average results on the phylim, class, and order taxonomic levels, while k-NN was particularly effective on lower taxonomic levels, including family and genus. Moreover, both of these traditional machine learning models usually outperformed the deep learning-based Multilayer perceptron regressor. This is probably due to a relatively limited number of samples available for model training in the two public datasets analyzed in our study. The data and code allowing one to reproduce the presented results can be accessed using our GitHub repository at: https: //github. com/Vincent-Therrien/micropyome.
Aouabed et al. (Wed,) studied this question.