What question did this study set out to answer?

The study aims to assess the effectiveness of ChatGPT-4o in managing complex decisions in multiple sclerosis and neuromyelitis optica compared to neurologists.

April 14, 2026Open Access

Large language models as clinical decision-support tools in multiple sclerosis and neuromyelitis optica spectrum disorders: A comparative study of ChatGPT-4o and neurologists

Key Points

The study aims to assess the effectiveness of ChatGPT-4o in managing complex decisions in multiple sclerosis and neuromyelitis optica compared to neurologists.
Conducted a comparative analysis using 21 clinical vignettes from a multicenter research framework.
Responses from 290 neurologists were compared to ChatGPT-4o with and without Retrieval-Augmented Generation.
Evaluated guideline adherence in decision-making and the prevalence of therapeutic inertia.
ChatGPT-4o with RAG showed higher guideline adherence (80.5%) than neurologists (66.5%) with significant statistical difference (p = 0.001).
ChatGPT-4o was identified as an independent predictor of evidence-based decision-making (Odds ratio 3.17).
The model demonstrated lower rates of therapeutic inertia, although parity was observed in emerging biomarker scenarios.

Abstract

Background Therapeutic inertia (TI) remains a critical barrier to optimizing outcomes in multiple sclerosis (MS) and neuromyelitis optica spectrum disorders (NMOSDs). Objective We evaluated the proficiency of ChatGPT-4o in addressing complex neuro-immunological management challenges compared to practicing neurologists. Methods We conducted a comparative analysis using 21 clinical vignettes derived from a multicenter research framework. Responses from 290 neurologists were benchmarked against ChatGPT-4o, both with and without Retrieval-Augmented Generation (RAG). The primary endpoint was guideline-adherent decision-making at the item level, with the prevalence of TI as a secondary clinical outcome. Scenarios included MS therapy escalation, aquaporin-4-IgG positive NMOSD management, and serum neurofilament light chain integration. Results ChatGPT-4o with RAG achieved significantly higher guideline adherence in decision-making than neurologists (80.5% vs. 66.5%; p = 0.001). Multivariable generalized estimating equation models identified ChatGPT-4o as an independent predictor of evidence-based decision-making (Odds ratio 3.17; 95% confidence interval: 2.05-4.88; p < 0.0001). While the model demonstrated a lower propensity for TI overall, performance parity occurred in emerging biomarker scenarios where clinical consensus is still evolving. Conclusions ChatGPT-4o demonstrated superior guideline adherence and reduced TI compared to neurologists. Integrating Large Language Models as clinical decision-support tools may enhance the standardization of neuro-immunological care and serve as a valuable adjunct to mitigate human cognitive biases.

Large language models as clinical decision-support tools in multiple sclerosis and neuromyelitis optica spectrum disorders: A comparative study of ChatGPT-4o and neurologists

Key Points

Abstract

Cite This Study