What question did this study set out to answer?

This research aims to evaluate deep learning methods for mammographic mass detection while considering breast density variations. It investigates how density affects detection performance across multiple datasets.

June 12, 2026Open Access

Density-Aware Multi-Dataset Evaluation of Deep Learning for Mammographic Mass Detection and BI-RADS Classification

Key Points

This research aims to evaluate deep learning methods for mammographic mass detection while considering breast density variations. It investigates how density affects detection performance across multiple datasets.
Introduced Mass-Bench, a unified multi-dataset benchmark incorporating CBIS-DDSM, INBREAST, VINDr-Mammo, and DMID with standardized density categorizations.
Evaluated YOLO-based detection models with a leakage-controlled protocol and density-stratified assessments.
Performed binary ACR classification and BI-RADS discrimination along with multi-class classification, focusing on sensitivity to dataset variations.
Achieved peak AUC values up to 0.943 for mass detection; performance decreased as ACR density increased.
Binary ACR classification F1-scores reached 0.976, while binary BI-RADS accuracy attained 0.93.
Multi-class classification showed increased sensitivity to dataset diversity, highlighting challenges in dense breast categories.

Abstract

Breast density has a significant impact on how clearly masses appear in mammography. It can also introduce bias in automatic localization systems when density distributions are uneven. Although advances in deep learning-based detection methods have been made, most studies report overall performance without explicitly accounting for variability associated with breast density. Breast cancer diagnosis from mammography is strongly influenced by dataset composition, annotation variability, and breast density distribution, factors that are rarely controlled in current AI evaluations. We introduce Mass-Bench, a clinically balanced and harmonized multi-dataset benchmark that integrates CBIS-DDSM, INBREAST, VINDr-Mammo, and DMID under a unified canonical schema, with standardized ACR density and BI-RADS encoding. Using a leakage-controlled and distribution-aware evaluation protocol, density-stratified mass detection and lesion-centered regions of interest (ROIs) classification were assessed across datasets. YOLO-based detection models achieved peak area under the curve (AUC) values up to 0.943; however, performance systematically degraded with increasing ACR density, revealing limitations that are often masked in imbalanced evaluations. By enforcing clinically representative density distributions, Mass-Bench provides a more reliable estimation of localization performance, which directly impacts downstream clinical tasks. In this context, binary ACR classification achieved F1-scores up to 0.976, while binary BI-RADS discrimination reached accuracies up to 0.93. However, multi-class classification remained more challenging, showing increased sensitivity to dataset heterogeneity and contextual information. These findings demonstrate that conventional evaluations may overestimate robustness, particularly in dense breast categories, and highlight the importance of density-aware benchmarking for developing reliable and clinically applicable AI systems in mammography.

Density-Aware Multi-Dataset Evaluation of Deep Learning for Mammographic Mass Detection and BI-RADS Classification

Key Points

Abstract

Cite This Study