Deep learning classifiers for fine-grained visual recognition provide no per-prediction reliability estimate, yet selective prediction methods that allow classifiers to abstain remain evaluated only on standard benchmarks, untested in domains where visual degradation drives failure patterns. We present AquaSelect, a post-hoc selective prediction framework that learns when to abstain rather than risk a misclassification. AquaSelect trains a lightweight binary selection head of 213K parameters on a frozen backbone to predict classifier correctness, fusing this with temperature-calibrated confidence and image quality features via interpretable logistic regression. Because the backbone remains frozen, the selection head can be retrained for new environments without touching the base classifier. Evaluated on two underwater species datasets, AQUA20 with 8,171 images across 20 classes and Sea Animals with 13,711 images across 23 classes, using ConvNeXt-Tiny and DeiT-Small backbones across three seeds, AquaSelect outperforms Softmax Response and Monte Carlo Dropout on all six seed-backbone evaluations on AQUA20 and improves mean coverage metrics on Sea Animals. At 80% coverage, accuracy rises from 87.3% to 94.8% and Macro F1 from 81.5% to 88.6%, surpassing the benchmark full-data accuracy of 90.69% despite using 15% less training data. We also report that RAPS conformal prediction sets averaging 3.7 to 5.0 classes are impractical for single-label classification, and fusing set sizes with learned scores degrades selection quality. Ablation identifies the learned selection head as the dominant component. The framework runs at 149 FPS, 2.8 times faster than Deep Ensembles, and applies to any classification system where errors carry asymmetric costs. • First selective prediction study for visually degraded classification. • Frozen backbone design allows redeployment by retraining 213K parameters. • Learned score fusion improves accuracy from 87.3% to 94.8% at 80% coverage. • Fusing conformal set sizes with learned scores degrades selection quality. • Cross-dataset validation on two marine benchmarks confirms generalizability.
Daga et al. (Fri,) studied this question.