What question did this study set out to answer?

This research explores how large language models aid surgical decision-making for lumbar disc herniation by assessing their performance with different clinical inputs.

April 12, 2026Open Access

Exploratory study of large language models in surgical decision-making for lumbar disc herniation: a multicenter analysis based on multisource clinical information

Key Points

This research explores how large language models aid surgical decision-making for lumbar disc herniation by assessing their performance with different clinical inputs.
Evaluated 48 LDH cases from multiple centers using four LLMs: GPT-5, Gemini 2.5 Pro, DeepSeek-R1, Grok-4.
Conducted binary classification for surgical vs. conservative treatment with two input groups—one using only radiology reports and the other adding manually summarized clinical information.
Measured performance metrics including sensitivity, specificity, positive predictive value, negative predictive value, accuracy, F1 score, and Cohen's kappa.
GPT-5 showed a sensitivity of 0.92 and accuracy of 0.75 using radiology reports alone, improving to 0.85 accuracy with clinical information.
Only Gemini demonstrated a significant performance improvement between input scenarios (P = 0.013).
Incorporating clinical data generally increased high-confidence predictions, but model alignment with clinical decisions varied.

Abstract

To explore the performance of large language models (LLMs) in surgical decision-making for lumbar disc herniation (LDH), and to evaluate the impact of radiology report text and manually summarized clinical information on model decision outputs. A total of 48 LDH cases from multiple centers were included. Four mainstream LLMs (GPT-5, Gemini 2.5 Pro, DeepSeek-R1, and Grok-4) were used to perform a binary classification task (surgical vs. conservative treatment). Two input scenarios were designed: Group A used radiology report text only, while Group B incorporated additional manually summarized clinical information based on the same reports. Primary performance metrics included sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1 score. Cohen’s kappa was reported as a supplementary measure of agreement. Decision confidence was further analyzed using stratified analysis. Using radiology report text alone, GPT-5 demonstrated relatively strong diagnostic performance, with a sensitivity of 0.92, specificity of 0.58, and accuracy of 0.75. After incorporating clinical information, its accuracy increased to 0.85, with improvements observed in specificity, PPV, NPV, and F1 score. Gemini and Grok also showed performance improvement following the addition of clinical information, whereas DeepSeek-R1 exhibited minimal change across input scenarios. McNemar’s test indicated that only Gemini showed a statistically significant difference between the two groups (P = 0.013). Confidence analysis showed that the inclusion of clinical information increased the coverage of high-confidence predictions in most models; however, the alignment between high-confidence outputs and actual clinical decisions varied across models. This exploratory study suggests that adding clinical information, such as symptoms, disease duration, and prior treatment, to radiology report text may help some LLMs produce outputs that are more consistent with actual clinical decisions in LDH. However, the findings are limited by the small sample size, the quality of the input data, and the complexity of real clinical decision-making. Further validation in larger studies with more complete information is still needed.

Mark Helpful

Bookmark

Relay

View Full Paper