Background: Diabetic macular edema (DME) affects 6. 81% of diabetic patients worldwide and represents a leading cause of vision loss in working-age populations. While optical coherence tomography (OCT) provides high-resolution imaging for DME diagnosis, current automated detection systems rely on supervised learning approaches that require extensive labeled datasets. This study investigates whether self-supervised learning using Masked Autoencoder (MAE) pre-training combined with Vision Transformer (ViT) architecture can achieve superior diagnostic performance while reducing dependence on labeled data. Methods: We developed MAEViT, a two-stage approach combining self-supervised MAE pre-training with supervised fine-tuning for DME detection. Using the publicly available Kermany dataset (109, 312 OCT images), we performed binary classification between DME (11, 559 images) and non-DME cases (97, 753 images). During pre-training, our ViT-Base model learned to reconstruct OCT images with 75% of patches randomly masked for 1, 000 epochs. Subsequently, we fine-tuned the encoder with a classification head for 30 epochs. Performance was compared against ResNet18, VGG19bn, EfficientNetV2, and standard ViT without pre-training. Evaluation metrics included accuracy, sensitivity, specificity, F1-score, and AU-ROC with 95% confidence intervals calculated using bootstrap methods. Results: MAEViT achieved exceptional diagnostic performance with AU-ROC 0. 999 (95% CI: 0. 999-1. 000), accuracy 98. 5% (95% CI: 97. 7-99. 2%), sensitivity 99. 6% (95% CI: 98. 7-100%), and specificity 98. 1% (95% CI: 97. 2-99. 1%). This significantly outperformed all comparison models (p<0. 001), including VGG19bn (AU-ROC 0. 997), standard ViT (AU-ROC 0. 995), and EfficientNetV2 (AU-ROC 0. 993). ResNet18 showed poor performance with AU-ROC 0. 902 despite achieving perfect sensitivity, demonstrating severe overfitting. The F1-score of 0. 971 for MAEViT indicated excellent balance between precision and recall. Conclusions: Self-supervised MAE pre-training with Vision Transformer architecture demonstrates superior performance for automated DME detection from OCT images, surpassing traditional supervised CNN approaches. This method's ability to learn robust representations from unlabeled data addresses critical challenges in medical image analysis where annotated datasets are scarce and expensive. Our findings suggest MAEViT as a promising approach for developing scalable, accurate DME screening systems.
Takinami et al. (Tue,) studied this question.