Abstract Understanding the nature of the data, dealing with outliers and redundant information are key issues when designing a proper metric for clustering and classification. Distance-based generalized linear models are prediction tools which can be applied to any kind of data whenever a distance measure can be computed among units. In this work, robust ad-hoc metrics are proposed to be used in the predictors’ space of these models, incorporating more flexibility to this tool. Their performance is evaluated by means of an extensive simulation study and compared to those based on Gower’s and the Euclidean distances through several data sets of multivariate heterogeneous data with the presence of anomalous observations. Accuracy, precision, recall, F1 score, Auc Roc and Log Loss measures are used to evaluate the effectiveness in the prediction of responses. Applications on real data are provided in order to illustrate the predictive power of these models, which appear to be competitive with state-of-the-art machine learning approaches, such as random forests or neural networks. Computations are made using the package for .
Boj et al. (Tue,) studied this question.