Screening mammography presents complementary craniocaudal and mediolateral oblique views whose joint interpretation hinges on view-invariance for the same breast and sensitivity to contralateral asymmetry. We propose a self-supervised anatomy-aware with attention fusion framework (SCL-AF) that couples contrastive pretraining with cross-view positives and contralateral hard negatives, a lesion-guided tokenization that distills high-resolution images into a compact set of clinically meaningful tokens, and a geometry-biased, bidirectional attention fusion that reconciles evidence across views. Supervised fine-tuning uses a class-imbalance-aware objective together with view consistency and contralateral symmetry regularizers. Evaluated on the public CBIS-DDSM dataset, SCL-AF achieves ROC-AUC 0.942, PR-AUC 0.692, and SEN 0.631, which outperform strong baselines. Gains concentrate in the clinically relevant high-specificity regime with particularly large improvements on calcification-dominant breasts. Ablations show that removing cross-view positives or contralateral negatives substantially degrades high-specificity sensitivity and calibration, lesion-guided tokens with diversity priors outperform global or randomly sampled tokens, and two layers of bidirectional attention offer the best accuracy and latency trade-off. These results suggest that encoding mammographic anatomy directly into representation learning and fusion yields significant improvements at operating points suitable for screening triage.
Lyu et al. (Thu,) studied this question.