Background: With the continued advancement of artificial intelligence (AI), large language models (LLMs) such as GPT-4 may assist clinicians in evaluating patient candidacy for spinal cord stimulation (SCS). We compared a general-purpose, non–fine-tuned LLM (GPT-4), an expert multidisciplinary team (MDT), and a clinician-input, rule-based e-Health decision-support tool. The study focused exclusively on decision agreement and did not assess clinical outcomes (eg, pain relief or device retention). Methods: This single-center, retrospective cohort was conducted at Fondazione Istituto G. Giglio (Cefalù, Italy) and included 93 consecutive adults referred to the MDT for SCS evaluation between January 2022 and March 2024. The MDT issued binary recommendations (“proceed” vs “do not proceed”) as the reference standard. The e-Health tool generated “yes”, “maybe”, or “no” outputs from structured clinician-entered data. GPT-4 was applied zero-shot, using a single standardized prompt on anonymized vignettes within an offline environment. The primary endpoint was agreement (weighted κ) among MDT, e-Health, and GPT-4; sensitivity/specificity analyses explored three interpretations of “maybe”. Results: The MDT recommended SCS for 91.4% of patients, compared with 54.8% for the e-Health tool and 46.2% for GPT-4. Agreement was moderate for MDT vs e-Health (κ = 0.51) and e-Health vs GPT-4 (κ = 0.46), and fair for MDT vs GPT-4 (κ = 0.29). GPT-4 demonstrated a more conservative profile, favoring specificity over sensitivity. Conclusion: A non–fine-tuned GPT-4 approximated but did not replicate MDT decision-making, functioning as a high-specificity, low-sensitivity filter. A layered workflow combining rule-based tools with expert oversight and targeted LLM adaptation may best optimize SCS candidate selection. Keywords: artificial intelligence, large-language models, spinal cord stimulation, chronic pain, patient selection, neuromodulation
Bianco et al. (Thu,) studied this question.