Offline evaluation is the predominant method for scientific research in recommender systems, enabling the comparison of alternative recommendation approaches using pre-collected datasets and computational metrics without involving human participants. However, recent reproducibility studies reveal that many offline evaluations in the literature lack scientific rigor or adopt research practices that cast doubt on the validity of their findings. A particularly common and ultimately catastrophic flaw is the comparison of newly proposed machine learning models against untuned or poorly tuned baseline models. Combined with limited reproducibility, such practices raise serious concerns about the true progress achieved by increasingly complex recommendation algorithms. In this editorial, we argue for stronger methodological standards and summarize essential guidance and best practices for conducting rigorous offline evaluations of recommender systems. Accordingly, ACM Transactions on Recommender Systems will place increased emphasis on methodological rigor in all future submissions, with particular priority given to work that provides comprehensive reproducibility materials and clearly documents the tuning procedures used for baseline models.
Jannach et al. (Mon,) studied this question.