Medical text classification is an important task in biomedical natural language processing, but prediction errors remain problematic in high-stakes settings where reliability matters in addition to accuracy. To address this challenge, this paper proposes a learned selective deferral framework for biomedical sentence classification that allows uncertain predictions to be deferred under constrained review budgets. The framework combines a transformer-based classifier with uncertainty estimation, temperature scaling, and a learned deferral policy that predicts the likelihood of model error from multiple signals, including confidence, entropy, calibration-aware features, and Monte Carlo Dropout descriptors. Deferral decisions are applied under fixed budgets to improve the use of limited review capacity. Experiments on the PubMed 200k RCT dataset show that budget-constrained deferral reduces system-level risk. Using PubMedBERT as the primary backbone, deferring 20% of the highest-risk cases reduces system risk from 0.1108 to 0.0360. Compared with a calibrated confidence-threshold baseline, the learned policy provides modest but generally favorable improvements, with statistical significance observed at the 20% budget. Additional experiments across PubMedBERT, BioBERT, and SciBERT suggest that the framework transfers across biomedical transformer backbones, while calibration improves the reliability of confidence estimates and learned policies outperform random deferral.
Albalawi et al. (Sat,) studied this question.