What question did this study set out to answer?

The study aims to evaluate the progression of ChatGPT's accuracy in anatomy over a two-year period.

May 16, 2026

Is artificial intelligence getting better at anatomy? A two‐year review of ChatGPT 's free public versions

Key Points

The study aims to evaluate the progression of ChatGPT's accuracy in anatomy over a two-year period.
Administered 120 multiple-choice questions on anatomy to ChatGPT across three versions (3.5, 4o, 5).
Categorized incorrect responses into informational, logical, or combined errors.
Analyzed performance using Cochran's Q test to identify temporal differences in accuracy.
Correct response rates improved from 45.8% (ChatGPT-3.5) to 86.7% (ChatGPT-5), Q = 55.433, p < 0.001.
Significant reductions in combined errors and guessing observed across versions.
Informational errors remained predominant, accounting for over 70% of incorrect responses.

Abstract

Artificial intelligence and large language models have significantly influenced medical education by enhancing learning experiences. While previous studies have assessed ChatGPT's performance on anatomy-related questions, a notable gap remains in understanding its accuracy over time. This longitudinal study evaluated the progression of ChatGPT's accuracy using 120 five-option multiple-choice questions covering anatomical systems, written by anatomy faculty members. Incorrect responses were categorized as informational, logical, or combined errors, while correct responses were genuine or guessed (selecting the correct option despite providing incorrect explanatory content). Performance was further evaluated in relation to the characteristics of questions and their cognitive levels according to Bloom's Taxonomy. Following a deliberate interval of approximately 2 years, the same set was administered to ChatGPT-3.5, ChatGPT-4o, and ChatGPT-5 without providing feedback. Temporal differences in accuracy and reasoning were analyzed using Cochran's Q test. Correct-response rates increased significantly across versions, from 45.8% (ChatGPT-3.5) to 73.3% (ChatGPT-4o) and 86.7% (ChatGPT-5) (Q = 55.433, p < 0.001), accompanied by reductions in combined errors and guessed correct. Informational errors remained predominant, accounting for 47/65 (72.3%), 26/32 (81.25%), and 12/16 (75.0%) of incorrect responses in ChatGPT-3.5, ChatGPT-4o, and ChatGPT-5, respectively. Performance improved significantly across successive versions at both lower and higher cognitive levels (Q = 40.478, p < 0.001; Q = 16.095, p < 0.001, respectively). The 2-year evolution of ChatGPT's anatomical knowledge highlights its potential as a supplementary resource for anatomy education and individualized learning. This study provides modest evidence of progressive improvements in anatomical accuracy and reasoning across successive model versions, suggesting that continued refinement may further enhance reliability.

Bookmark

Is artificial intelligence getting better at anatomy? A two‐year review of ChatGPT 's free public versions

Key Points

Abstract

Cite This Study