Artificial intelligence and large language models have significantly influenced medical education by enhancing learning experiences. While previous studies have assessed ChatGPT's performance on anatomy-related questions, a notable gap remains in understanding its accuracy over time. This longitudinal study evaluated the progression of ChatGPT's accuracy using 120 five-option multiple-choice questions covering anatomical systems, written by anatomy faculty members. Incorrect responses were categorized as informational, logical, or combined errors, while correct responses were genuine or guessed (selecting the correct option despite providing incorrect explanatory content). Performance was further evaluated in relation to the characteristics of questions and their cognitive levels according to Bloom's Taxonomy. Following a deliberate interval of approximately 2 years, the same set was administered to ChatGPT-3.5, ChatGPT-4o, and ChatGPT-5 without providing feedback. Temporal differences in accuracy and reasoning were analyzed using Cochran's Q test. Correct-response rates increased significantly across versions, from 45.8% (ChatGPT-3.5) to 73.3% (ChatGPT-4o) and 86.7% (ChatGPT-5) (Q = 55.433, p < 0.001), accompanied by reductions in combined errors and guessed correct. Informational errors remained predominant, accounting for 47/65 (72.3%), 26/32 (81.25%), and 12/16 (75.0%) of incorrect responses in ChatGPT-3.5, ChatGPT-4o, and ChatGPT-5, respectively. Performance improved significantly across successive versions at both lower and higher cognitive levels (Q = 40.478, p < 0.001; Q = 16.095, p < 0.001, respectively). The 2-year evolution of ChatGPT's anatomical knowledge highlights its potential as a supplementary resource for anatomy education and individualized learning. This study provides modest evidence of progressive improvements in anatomical accuracy and reasoning across successive model versions, suggesting that continued refinement may further enhance reliability.
Paslı et al. (Thu,) studied this question.