What question did this study set out to answer?

The research compares the performance of large language models in neuroanatomical diagnostics, focusing on accuracy and terminology use.

synapse

⌘+K

synapse

⌘+K

March 7, 2026Open Access

Evaluation of large language models in clinical neuroanatomy: a comparative scoring analysis based on accuracy, concordance, insight, and anatomical terminology accuracy

Key Points

The research compares the performance of large language models in neuroanatomical diagnostics, focusing on accuracy and terminology use.
Utilized structured scoring frameworks like ACI and ATA for evaluation.
Analyzed diagnostic performance based on accuracy and anatomical terminology.
Compared ChatGPT-4 and Gemini 2.5 in various neuroanatomical scenarios.
ChatGPT-4 demonstrated strong and stable diagnostic performance with high accuracy.
Gemini 2.5 showed inconsistent results, particularly sensitive to prompt changes.
Both models can be valuable in clinical and educational settings.

Abstract

ChatGPT-4 demonstrates strong and stable diagnostic performance in neuroanatomical cases, with high accuracy and precise anatomical language. Gemini 2.5 shows potential, but is more sensitive to prompt variations and performs inconsistently in complex scenarios. Structured scoring frameworks like ACI and ATA offer valuable tools for evaluating LLMs in both clinical and educational settings.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluation of large language models in clinical neuroanatomy: a comparative scoring analysis based on accuracy, concordance, insight, and anatomical terminology accuracy

Key Points

Abstract

Cite This Study