Medical image classification still struggles with three major issues: limited training data, severe class imbalance, and fuzzy decision boundaries between disease categories. Deep learning models now perform as well as human experts in many tasks, but there’s been surprisingly little work on how best to combine different architectures. In this study, I evaluate ensemble learning across four datasets from the MedMNIST v2 collection - BloodMNIST, BreastMNIST, DermaMNIST, and OrganAMNIST each representing different clinical imaging challenges. I built an ensemble using four modern architectures: ConvNeXt-Base, Vision Transformer (ViT-Base), EfficientNetV2-M, and InceptionResNetV2. The results show that modern backbones consistently beat the official ResNet baselines on every task. More interestingly, I discovered what I call “Validation Starvation”, a critical threshold that determines which ensemble method works best. When there’s enough validation data, Rigorous Stacking (a meta-learning approach) wins by learning to fix systematic errors between models. This delivered state-of-the-art accuracy on BloodMNIST (99.33%) and BreastMNIST (93.59%, which is +3.5% over baseline). But when classes are extremely imbalanced or rare, simple Soft Voting actually works better. it achieved state-of-the-art on DermaMNIST (91.97%), a massive 15.2% jump over previous benchmarks. I also validated that these ensembles are safe for clinical use through calibration analysis and entropy-based uncertainty quantification, showing they can reliably flag ambiguous cases for human review. These findings give us a practical, reproducible strategy for deploying high-performance diagnostic AI in resource-limited medical settings.
Hrushikesh Sanap (Sat,) studied this question.