Breast density has a significant impact on how clearly masses appear in mammography. It can also introduce bias in automatic localization systems when density distributions are uneven. Although advances in deep learning-based detection methods have been made, most studies report overall performance without explicitly accounting for variability associated with breast density. Breast cancer diagnosis from mammography is strongly influenced by dataset composition, annotation variability, and breast density distribution, factors that are rarely controlled in current AI evaluations. We introduce Mass-Bench, a clinically balanced and harmonized multi-dataset benchmark that integrates CBIS-DDSM, INBREAST, VINDr-Mammo, and DMID under a unified canonical schema, with standardized ACR density and BI-RADS encoding. Using a leakage-controlled and distribution-aware evaluation protocol, density-stratified mass detection and lesion-centered regions of interest (ROIs) classification were assessed across datasets. YOLO-based detection models achieved peak area under the curve (AUC) values up to 0.943; however, performance systematically degraded with increasing ACR density, revealing limitations that are often masked in imbalanced evaluations. By enforcing clinically representative density distributions, Mass-Bench provides a more reliable estimation of localization performance, which directly impacts downstream clinical tasks. In this context, binary ACR classification achieved F1-scores up to 0.976, while binary BI-RADS discrimination reached accuracies up to 0.93. However, multi-class classification remained more challenging, showing increased sensitivity to dataset heterogeneity and contextual information. These findings demonstrate that conventional evaluations may overestimate robustness, particularly in dense breast categories, and highlight the importance of density-aware benchmarking for developing reliable and clinically applicable AI systems in mammography.
Zepeda-Reyes et al. (Wed,) studied this question.