To explore the performance of large language models (LLMs) in surgical decision-making for lumbar disc herniation (LDH), and to evaluate the impact of radiology report text and manually summarized clinical information on model decision outputs. A total of 48 LDH cases from multiple centers were included. Four mainstream LLMs (GPT-5, Gemini 2.5 Pro, DeepSeek-R1, and Grok-4) were used to perform a binary classification task (surgical vs. conservative treatment). Two input scenarios were designed: Group A used radiology report text only, while Group B incorporated additional manually summarized clinical information based on the same reports. Primary performance metrics included sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and F1 score. Cohen’s kappa was reported as a supplementary measure of agreement. Decision confidence was further analyzed using stratified analysis. Using radiology report text alone, GPT-5 demonstrated relatively strong diagnostic performance, with a sensitivity of 0.92, specificity of 0.58, and accuracy of 0.75. After incorporating clinical information, its accuracy increased to 0.85, with improvements observed in specificity, PPV, NPV, and F1 score. Gemini and Grok also showed performance improvement following the addition of clinical information, whereas DeepSeek-R1 exhibited minimal change across input scenarios. McNemar’s test indicated that only Gemini showed a statistically significant difference between the two groups (P = 0.013). Confidence analysis showed that the inclusion of clinical information increased the coverage of high-confidence predictions in most models; however, the alignment between high-confidence outputs and actual clinical decisions varied across models. This exploratory study suggests that adding clinical information, such as symptoms, disease duration, and prior treatment, to radiology report text may help some LLMs produce outputs that are more consistent with actual clinical decisions in LDH. However, the findings are limited by the small sample size, the quality of the input data, and the complexity of real clinical decision-making. Further validation in larger studies with more complete information is still needed.
Ma et al. (Fri,) studied this question.