Abstract Accurate and reproducible electroencephalography (EEG)-based classification of dementia remains a key challenge in computational neurodiagnostics. The open-access AHEPA dataset has become the most commonly used benchmark for Alzheimer’s disease (AD) and Frontotemporal dementia (FTD) classification, yet reported results vary widely due to methodological inconsistencies. This study presents the first systematic and quantitative benchmark review of all published machine learning approaches applied to the AHEPA dataset. Forty-six studies were reviewed and stratified into three validity tiers, with Validity 1 representing the highest methodological rigor and Validity 3 the lowest.According to their evaluation rigor: (1) subject-level validation (e.g., Leave-One-Subject-Out cross-validation, LOSO-CV), (2) subject-level train/test splits, and (3) epoch-level k-fold cross-validation. Performance metrics were normalized across classification problems. The analysis revealed that methodological rigor is inversely correlated with reported accuracy: for AD versus Cognitively Normal controls, mean accuracy decreased from 90.81% overall to 82.11% in Validity-1 studies; for FTD versus controls, accuracy dropped from 86.53% to 75.18%. Linear regression analyses demonstrated that weaker validation protocols were associated with systematic increases of 7–10% points in reported accuracy, explaining more than half of the observed performance variance. Deep and hybrid models reported the highest nominal accuracies, but under proper validation, traditional algorithms performed comparably, indicating that data leakage often drives apparent improvements. The review also highlights the lack of cross-configuration generalization and the urgent need for adaptive, montage-independent methodologies. Overall, this benchmark establishes the first reproducible reference framework for EEG-based dementia classification on the AHEPA dataset, providing quantitative baselines and validity criteria against which all future studies should be evaluated.
Miltiadous et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: