What does this research mean for the field?

Ensemble deep learning models outperform traditional ResNet baselines in medical image classification tasks, achieving state-of-the-art accuracy across diverse modalities and datasets. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.SUPPORTS_CONSENSUS.

What question did this study set out to answer?

The aim is to evaluate the effectiveness of ensemble deep learning architectures for medical image classification.

March 2, 2026Open Access

Ensemble Deep Learning for Medical Image Classification Across Diverse Modalities: A Multi-Architecture Evaluation With Uncertainty Quantification on MedMNIST

Key Points

The aim is to evaluate the effectiveness of ensemble deep learning architectures for medical image classification.
Ensemble learning evaluated across four datasets: BloodMNIST, BreastMNIST, DermaMNIST, and OrganAMNIST.
Implemented four architectures: ConvNeXt-Base, Vision Transformer, EfficientNetV2-M, and InceptionResNetV2.
Conducted calibration analysis and entropy-based uncertainty quantification.
Modern architectures outperformed ResNet baselines in all tasks.
Rigorous Stacking achieved state-of-the-art accuracy on BloodMNIST (99.33%) and BreastMNIST (93.59%).
Simple Soft Voting excelled on DermaMNIST with a 15.2% accuracy improvement over previous benchmarks.

Abstract

Medical image classification still struggles with three major issues: limited training data, severe class imbalance, and fuzzy decision boundaries between disease categories. Deep learning models now perform as well as human experts in many tasks, but there’s been surprisingly little work on how best to combine different architectures. In this study, I evaluate ensemble learning across four datasets from the MedMNIST v2 collection - BloodMNIST, BreastMNIST, DermaMNIST, and OrganAMNIST each representing different clinical imaging challenges. I built an ensemble using four modern architectures: ConvNeXt-Base, Vision Transformer (ViT-Base), EfficientNetV2-M, and InceptionResNetV2. The results show that modern backbones consistently beat the official ResNet baselines on every task. More interestingly, I discovered what I call “Validation Starvation”, a critical threshold that determines which ensemble method works best. When there’s enough validation data, Rigorous Stacking (a meta-learning approach) wins by learning to fix systematic errors between models. This delivered state-of-the-art accuracy on BloodMNIST (99.33%) and BreastMNIST (93.59%, which is +3.5% over baseline). But when classes are extremely imbalanced or rare, simple Soft Voting actually works better. it achieved state-of-the-art on DermaMNIST (91.97%), a massive 15.2% jump over previous benchmarks. I also validated that these ensembles are safe for clinical use through calibration analysis and entropy-based uncertainty quantification, showing they can reliably flag ambiguous cases for human review. These findings give us a practical, reproducible strategy for deploying high-performance diagnostic AI in resource-limited medical settings.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Hrushikesh Sanap (Sat,) studied this question.

synapsesocial.com/papers/69a52e45f1e85e5c73bf1c3f https://doi.org/https://doi.org/10.5281/zenodo.18813204

Bookmark

View Full Paper