What type of study is this?

This is a Quantitative Study study.

October 8, 2025Open Access

Pre-trained Vision Transformer With Masked Autoencoder for Automated Diabetic Macular Edema Detection from Optical Coherence Tomography Images

Key Points

MAE_ViT achieved AU-ROC of 0.999 with sensitivity of 99.6%, significantly outperforming traditional models.
With accuracy of 98.5%, MAE_ViT’s F1-score of 0.971 shows excellent balance between precision and recall.
Self-supervised MAE pre-training with Vision Transformer reduces reliance on labeled data, addressing dataset scarcity.
The study utilized the Kermany dataset, classifying 11,559 DME and 97,753 non-DME OCT images to train the model.

Abstract

Background: Diabetic macular edema (DME) affects 6. 81% of diabetic patients worldwide and represents a leading cause of vision loss in working-age populations. While optical coherence tomography (OCT) provides high-resolution imaging for DME diagnosis, current automated detection systems rely on supervised learning approaches that require extensive labeled datasets. This study investigates whether self-supervised learning using Masked Autoencoder (MAE) pre-training combined with Vision Transformer (ViT) architecture can achieve superior diagnostic performance while reducing dependence on labeled data. Methods: We developed MAEViT, a two-stage approach combining self-supervised MAE pre-training with supervised fine-tuning for DME detection. Using the publicly available Kermany dataset (109, 312 OCT images), we performed binary classification between DME (11, 559 images) and non-DME cases (97, 753 images). During pre-training, our ViT-Base model learned to reconstruct OCT images with 75% of patches randomly masked for 1, 000 epochs. Subsequently, we fine-tuned the encoder with a classification head for 30 epochs. Performance was compared against ResNet18, VGG19bn, EfficientNetV2, and standard ViT without pre-training. Evaluation metrics included accuracy, sensitivity, specificity, F1-score, and AU-ROC with 95% confidence intervals calculated using bootstrap methods. Results: MAEViT achieved exceptional diagnostic performance with AU-ROC 0. 999 (95% CI: 0. 999-1. 000), accuracy 98. 5% (95% CI: 97. 7-99. 2%), sensitivity 99. 6% (95% CI: 98. 7-100%), and specificity 98. 1% (95% CI: 97. 2-99. 1%). This significantly outperformed all comparison models (p<0. 001), including VGG19bn (AU-ROC 0. 997), standard ViT (AU-ROC 0. 995), and EfficientNetV2 (AU-ROC 0. 993). ResNet18 showed poor performance with AU-ROC 0. 902 despite achieving perfect sensitivity, demonstrating severe overfitting. The F1-score of 0. 971 for MAEViT indicated excellent balance between precision and recall. Conclusions: Self-supervised MAE pre-training with Vision Transformer architecture demonstrates superior performance for automated DME detection from OCT images, surpassing traditional supervised CNN approaches. This method's ability to learn robust representations from unlabeled data addresses critical challenges in medical image analysis where annotated datasets are scarce and expensive. Our findings suggest MAEViT as a promising approach for developing scalable, accurate DME screening systems.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper