What question did this study set out to answer?

This study evaluates how well large language models can replicate neurologists’ stroke scoring metrics like NIHSS and ASPECTS.

May 8, 2026

View Full Paper

Abstract Number: Esoc2026a1487 Automated Stroke Scoring: Agreement Between Large Language Models and Neurologists in Nihss and Aspects

RLR Domingos Da Costa LopesUniversidade Federal do Rio de Janeiro MPMaria Carlos PereiraAdministração Regional de Saúde de Lisboa e Vale do Tejo CGCarolina GonçalvesAdministração Regional de Saúde de Lisboa e Vale do Tejo

Key Points

This study evaluates how well large language models can replicate neurologists’ stroke scoring metrics like NIHSS and ASPECTS.
Processed 487 neurological examination records and CT scans using five LLMs.
Assessed agreement with neurologist ratings using various statistical methods including ICC and kappa.
Measured time reduction for assessment tasks with LLM support.
Mean NIHSS score was 10.9; LLMs showed good agreement with neurologists, particularly GeminiTM with an ICC of 0.964.
Kappa for NIHSS categorization was 0.827, indicating almost perfect agreement.
LLM-based scoring reduced assessment time by up to 16.4 seconds per patient.

Abstract

Abstract Background and aims Accurate assessment of stroke severity and imaging findings is essential for patient management, yet documentation is often inconsistent. Large language models (LLMs) may enable efficient standardized extraction of clinical metrics. This study evaluated the ability of LLMs to reproduce neurologists’ scoring of the NIHSS, ASPECTS, and (Oxfordshire-Community-Stroke-Project (OCSP) classifications. Methods Neurological examination records and non-contrast CT scans were processed using five LLMs (GPT-4oTM, Gemini-2. 5TM, DeepSeekV2TM, Claude-4. 0TM, and PerplexityTM) to extract NIHSS, ASPECTS and OCSP scores. Agreement with neurologist ratings was assessed using bias/mean error, absolute deviation, intraclass correlation coefficient (ICC), and weighted Cohen’s kappa. Results We included 487 patients (57. 5% women; mean age 76. 3±12. 8 years). Mean NIHSS was 10. 9; 32. 2% had mild (5) and 23. 6% severe stroke (≥16). Excellent correlation was observed, with tendency toward NIHSS underestimation, ranging from −0. 09 (GPTTM) to −1. 78 (ClaudeTM), except for DeepSeekTM (+0. 83). Absolute deviation was lowest for Gemini™ (1. 08). ICC for NIHSS was excellent: GeminiTM (0. 964), GPTTM (0. 938). Kappa for NIHSS categorization (mild/moderate/severe) was 0. 827 with GeminiTM, demonstrating almost perfect agreement. For ASPECTS (mean 8. 69±2. 2), ICCs were uniformly excellent across models (0. 929-0. 968). Major ASPECTS misclassification (6-10 vs 0-5) was rare (1. 2% with GPT-4oTM and PerplexityTM) Agreement for OCSP classification was moderate, with the highest concordance for GPTTM (86. 3%). LLM-based scoring reduced assessment time by up to 16. 4 seconds per patient. Conclusions LLMs accurately reproduce neurologist-derived NIHSS and ASPECTS scores with minimal clinically relevant deviation, supporting their potential for scalable, automated extraction of stroke data from unstructured clinical records. Conflict of interest Rui Lopes: nothing to disclose

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Abstract Number: Esoc2026a1487 Automated Stroke Scoring: Agreement Between Large Language Models and Neurologists in Nihss and Aspects

Key Points

Abstract

Cite This Study