March 3, 2026Open Access

Smart Processing and Intelligent Navigation for Evaluation (SPINE): Comparing Clinicians and AI Language Model (GPT-4) in Spinal Cord Stimulation Candidate Selection

Puntos clave

The MDT recommended spinal cord stimulation for 91.4% of patients, highlighting a significant difference compared to GPT-4's 46.2%.
Agreement among MDT, e-Health tool, and GPT-4 showed moderate to fair κ values, indicating some alignment but notable discrepancies in decisions.
Analysis involved retrospective data from 93 patients and focused on decision agreement without assessing clinical outcomes such as pain relief.
The study suggests that integrating AI with expert oversight could improve candidate selection processes, potentially optimizing decisions during evaluations.

Resumen

Background: With the continued advancement of artificial intelligence (AI), large language models (LLMs) such as GPT-4 may assist clinicians in evaluating patient candidacy for spinal cord stimulation (SCS). We compared a general-purpose, non–fine-tuned LLM (GPT-4), an expert multidisciplinary team (MDT), and a clinician-input, rule-based e-Health decision-support tool. The study focused exclusively on decision agreement and did not assess clinical outcomes (eg, pain relief or device retention). Methods: This single-center, retrospective cohort was conducted at Fondazione Istituto G. Giglio (Cefalù, Italy) and included 93 consecutive adults referred to the MDT for SCS evaluation between January 2022 and March 2024. The MDT issued binary recommendations (“proceed” vs “do not proceed”) as the reference standard. The e-Health tool generated “yes”, “maybe”, or “no” outputs from structured clinician-entered data. GPT-4 was applied zero-shot, using a single standardized prompt on anonymized vignettes within an offline environment. The primary endpoint was agreement (weighted κ) among MDT, e-Health, and GPT-4; sensitivity/specificity analyses explored three interpretations of “maybe”. Results: The MDT recommended SCS for 91.4% of patients, compared with 54.8% for the e-Health tool and 46.2% for GPT-4. Agreement was moderate for MDT vs e-Health (κ = 0.51) and e-Health vs GPT-4 (κ = 0.46), and fair for MDT vs GPT-4 (κ = 0.29). GPT-4 demonstrated a more conservative profile, favoring specificity over sensitivity. Conclusion: A non–fine-tuned GPT-4 approximated but did not replicate MDT decision-making, functioning as a high-specificity, low-sensitivity filter. A layered workflow combining rule-based tools with expert oversight and targeted LLM adaptation may best optimize SCS candidate selection. Keywords: artificial intelligence, large-language models, spinal cord stimulation, chronic pain, patient selection, neuromodulation

Me gusta

Guardar

Ver artículo completo