Multi-label chest X-ray classification faces three critical challenges: (i) inadequate modeling of inter-pathology dependencies despite clinical co-occurrence patterns, (ii) severe class imbalance (11. 2−47. 6%) causing minority-class underperformance, and (iii) limited interpretability hindering clinical trust. Existing methods address these challenges independently; no current framework jointly models pathology dependencies, imbalance-aware training, and interpretable attention. We propose a Hierarchical Pathology-aware Vision Transformer (HP-ViT), which jointly addresses these limitations in a unified architecture by employing: Hierarchical Pathology-Aware Attention (HPAA) for explicit disease co-occurrence modeling through two-stage token refinement, Multi-Scale Feature Aggregation (MSFA) for detecting localized and diffuse abnormalities across four hierarchical scales, and Balanced Adaptive Focal Loss (BAFL) implementing curriculum-scheduled focal modulation that progressively transitions from class-balanced to difficulty-focused training. Evaluated on COVIDx, ChestX-ray14, and BIMCV-COVID19+ (N=36, 904 images), HP-ViT achieves macro-F1 of 0. 924, exact match ratio of 0. 842, and PPV of 0. 925, representing 1. 76%, 1. 32%, and 1. 5% improvements over state-of-the-art, with statistical significance (p<0. 001, McNemar’s test on per-sample exact-match correctness). HP-ViT requires only 12. 6 M parameters (85% reduction vs. ViT-B/16) with 29. 8 ms inference time, enabling real-time clinical deployment. Interpretability evaluation yields 83. 7% mean SSIM between attention maps and radiologist annotations, confirming pathology-aligned localization.
Khan et al. (Mon,) studied this question.