ChatGPT-4 demonstrates strong and stable diagnostic performance in neuroanatomical cases, with high accuracy and precise anatomical language. Gemini 2.5 shows potential, but is more sensitive to prompt variations and performs inconsistently in complex scenarios. Structured scoring frameworks like ACI and ATA offer valuable tools for evaluating LLMs in both clinical and educational settings.
Abdullah Örs (Fri,) studied this question.