What question did this study set out to answer?

This research aims to assess how distribution shifts in medical imaging affect artificial intelligence performance.

February 2, 2026Open Access

Complexity-Driven Adversarial Validation for Corrupted Medical Imaging Data

Key Points

This research aims to assess how distribution shifts in medical imaging affect artificial intelligence performance.
Developed a framework for evaluating distribution shifts in medical images.
Used the Cumulative Spectral Gradient score to measure classification complexity.
Simulated motion blur, noise, brightness, and contrast variations at different severity levels.
Analyzed twelve 2D medical imaging benchmarks from the MedMNIST collection.
The metric shows stability under noise and focus distortions.
High sensitivity is observed with variations in brightness and contrast.
Comparison with Cleanlab’s Non-IID score highlights correlation and class-wise discrepancies.

Abstract

Distribution shifts commonly arise in real-world machine learning scenarios in which the fundamental assumption that training and test data are drawn from independent and identically distributed samples is violated. In the case of medical data, such distribution shifts often occur during data acquisition and pose a significant challenge to the robustness and reliability of artificial intelligence systems in clinical practice. Additionally, quantifying these shifts without training a model remains a key open problem. This paper proposes a comprehensive methodological framework for evaluating the impact of such shifts on medical image datasets under artificial transformations that simulate acquisition variations, leveraging the Cumulative Spectral Gradient (CSG) score as a measure of multiclass classification complexity induced by distributional changes. Building on prior work, the proposed approach is meaningfully extended to twelve 2D medical imaging benchmarks from the MedMNIST collection, covering both binary and multiclass tasks, as well as grayscale and RGB modalities. We evaluate the metric analyzing its robustness to clinically inspired distribution shifts that are systematically simulated through motion blur, additive noise, brightness and contrast variation, and sharpness variation, each applied at three severity levels. This results in a large-scale benchmark that enables a detailed analysis of how dataset characteristics, transformation types, and distortion severity influence distribution shifts. Thus, the findings show that while the metric remains generally stable under noise and focus distortions, it is highly sensitive to variations in brightness and contrast. On the other hand, the proposed methodology is compared against Cleanlab’s widely used Non-IID score on the RetinaMNIST dataset using a pre-trained ResNet-50 model, including both class-wise analysis and correlation assessment between metrics. Finally, interpretability is incorporated through class activation map analysis on BloodMNIST and its corrupted variants to support and contextualize the quantitative findings.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper