Predicting whether a vehicle repossession will succeed can save lenders time and money, yet data and analytical models for this task are limited. This study develops and evaluates machine learning models to estimate the likelihood that a repossession assignment will result in vehicle recovery. Using the Automobile Repossession Dataset (ARD) — a proprietary, anonymized multi-year collection of assignment records from a regional repossession company — we assess model performance under alternative evaluation and data-splitting strategies to examine fairness and generalization. In addition to standard cross-validation on the full ARD, we evaluate the performance of three classification models when data is split by repossession client (to evaluate generalization ability and fairness concerns), or split by year (to evaluate robustness against temporal distribution shift). Results and statistical analyses indicate that each model performs similarly under standard cross-validation, but that alternative evaluation strategies can affect each model to varying degrees. For example, CatBoost achieves the highest Area Under the Receiver Operating Characteristic Curve (AUC-ROC) performance (0.692) under standard cross-validation, whereas logistic regression — a simpler and faster model — performs competitively when evaluated on future data. These findings highlight that robust validation is essential for operational machine learning in imbalanced datasets and provide the first benchmark for repossession prediction. The study offers new insight for lenders and recovery agencies seeking data-driven efficiency improvements.
Sinclair et al. (Wed,) studied this question.