This study compared the performance of ChatGPT-4o, ChatGPT-5, and Gemini 2.5 Flash on the 2025 Iranian internal medicine subspecialty board examinations. A total of 650 multiple-choice questions from six subspecialties were tested, excluding image-based items. Each question was presented in Persian, and responses were evaluated against the official answer key. Accuracy rates were 68.9% for ChatGPT-4o, 74.5% for ChatGPT-5, and 79.9% for Gemini 2.5 Flash, with Gemini performing significantly better than both ChatGPT versions. ChatGPT-5 also showed a significant improvement over ChatGPT-4o, confirming rapid progress in model development. Subspecialty analysis revealed stronger results in rheumatology and respiratory medicine compared to nephrology, while question type and length had no significant impact on outcomes. An artificial neural network that combined the outputs of all three models reached 81.6% accuracy, slightly exceeding Gemini alone. These findings highlight Gemini-2.5 as the most reliable model for this high-stakes internal medicine exam. The results support the growing role of advanced AI systems as assistants in medical education and clinical practice. However, further research is needed to assess their use in multimodal and real-world clinical tasks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shahab Sheikhalishahi
Shahid Sadoughi University of Medical Sciences and Health Services
Alireza Haddadi
Saina Sadeghipour
Shahid Sadoughi University of Medical Sciences and Health Services
Scientific Reports
Shahid Sadoughi University of Medical Sciences and Health Services
Building similarity graph...
Analyzing shared references across papers
Loading...
Sheikhalishahi et al. (Wed,) studied this question.
synapsesocial.com/papers/694025912d562116f28fe93e — DOI: https://doi.org/10.1038/s41598-025-31251-3