What question did this study set out to answer?

The study aims to improve reliability in medical text classification by implementing a learned selective deferral framework that mitigates prediction errors.

June 17, 2026Open Access

Learning Selective Deferral Policies for Reliable Medical Text Classification

Key Points

The study aims to improve reliability in medical text classification by implementing a learned selective deferral framework that mitigates prediction errors.
Developed a learned selective deferral framework combining transformer-based classifier and uncertainty estimation.
Applied temperature scaling and Monte Carlo dropout to enhance prediction reliability.
Conducted experiments on the PubMed 200k RCT dataset using budget-constrained deferral strategies.
Deferring 20% of the highest-risk cases reduced system risk from 0.1108 to 0.0360.
The learned policy showed significant improvements over a calibrated confidence-threshold baseline.
The framework successfully transferred across different biomedical transformer models like PubMedBERT, BioBERT, and SciBERT.

Abstract

Medical text classification is an important task in biomedical natural language processing, but prediction errors remain problematic in high-stakes settings where reliability matters in addition to accuracy. To address this challenge, this paper proposes a learned selective deferral framework for biomedical sentence classification that allows uncertain predictions to be deferred under constrained review budgets. The framework combines a transformer-based classifier with uncertainty estimation, temperature scaling, and a learned deferral policy that predicts the likelihood of model error from multiple signals, including confidence, entropy, calibration-aware features, and Monte Carlo Dropout descriptors. Deferral decisions are applied under fixed budgets to improve the use of limited review capacity. Experiments on the PubMed 200k RCT dataset show that budget-constrained deferral reduces system-level risk. Using PubMedBERT as the primary backbone, deferring 20% of the highest-risk cases reduces system risk from 0.1108 to 0.0360. Compared with a calibrated confidence-threshold baseline, the learned policy provides modest but generally favorable improvements, with statistical significance observed at the 20% budget. Additional experiments across PubMedBERT, BioBERT, and SciBERT suggest that the framework transfers across biomedical transformer backbones, while calibration improves the reliability of confidence estimates and learned policies outperform random deferral.

Learning Selective Deferral Policies for Reliable Medical Text Classification

Key Points

Abstract

Cite This Study