What question did this study set out to answer?

This analysis aims to compare the classification accuracy of general-purpose versus domain-specific multimodal models in detecting diabetic retinopathy from fundus images.

May 17, 2026Open Access

Comparative Analysis of General-Purpose vs. Domain-Specific Multimodal Models for Diabetic Retinopathy Classification

Puntos clave

This analysis aims to compare the classification accuracy of general-purpose versus domain-specific multimodal models in detecting diabetic retinopathy from fundus images.
Evaluated general-purpose models (Gemini 3 Flash, GPT-5.2, Pixtral-Large) and domain-specific models (MedGemma-1.5, RETFound, EyeCLIP, MedSigLIP) for accuracy.
Applied zero-/few-shot prompting, linear probing, and fine-tuning techniques across models.
Compared accuracy results quantitatively based on model performance in classifying diabetic retinopathy versus normal images.
Zero-shot accuracy was highest for MedSigLIP at 94.8% and lowest for Pixtral-Large at 70.7%.
Fine-tuning improved RETFound’s accuracy by 9.7%, while few-shot increases were most substantial for Pixtral-Large (+7.4%).
Overall, domain-specific models consistently outperformed general-purpose models in accuracy and stability.

Resumen

Background/Objectives: General-purpose and domain-specific multimodal foundation models show considerable promise in medical image analysis. In this study, we evaluated the classification accuracy of diabetic retinopathy vs. normal fundus images using general-purpose conversational models (Gemini 3 Flash, GPT-5.2, and Pixtral-Large), a medical conversational model (MedGemma-1.5), and its image-encoder (MedSigLIP), as well as ophthalmology-specific models (RETFound and EyeCLIP). Methods: We applied zero-/few-shot to general-purpose conversational models, linear probing, and fine-tuning approaches to domain-specific models for evaluation purposes. Results: We found that the zero-shot accuracies for Pixtral-Large (70.7%) and fine-tuned RETFound (77.1%) were comparable but lower than those of GPT-5.2 (77.9%), MedGemma-1.5 (88.2%), and Gemini 3 (88.5%) as well as the fine-tuned EyeCLIP (85.8%) and MedSigLIP (94.8%). The accuracy gains from few-shot prompting were substantial for Pixtral-Large (+7.4%) but were limited for GPT-5.2 (+3.6%), Gemini 3 (−3.4%), and MedGemma-1.5 (−1.1%). Embedding-based linear probing further improved accuracy over fine-tuning for RETFound (+9.7%) and yielded only marginal gains for EyeCLIP (+2.3%) but did not benefit MedSigLIP (−0.8%). Overall, with minimal prompting enhancement, general-purpose conversational models such as Gemini 3 and GPT-5.2 achieved performance comparable to ophthalmology-specific models that were either fine-tuned or enhanced via embedding-based linear probing, but remained inferior to MedSigLIP and its conversational counterpart, MedGemma-1.5. Conclusions: The findings highlight a trade-off between specialization and flexibility, where domain-specific models provide higher accuracy and stability, while general-purpose multimodal models offer greater accessibility, adaptability, and interactive reasoning, serving as complementary tools for retinal disease screening and clinical decision support.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo