OBJECTIVE: Examine how manipulating the insight component of the CRISPE (Capacity and Role, Insight, Statement, Personality, Experiment) prompt framework, affects ChatGPT’s accuracy of physical therapy neck pain recommendations. DESIGN: 3-by-2 factorial design. METHODS: Prompts for ChatGPT-4.0 were developed using CRISPE with increases in insight on clinical practice guidelines (CPG) and/or a custom clinical inference table (CIT). Examination (n = 72) and interventions (n = 144) outputs were evaluated by independent teams using precision, recall, and F1 score. Inter-rater reliability was assessed using intraclass correlation coefficients. Fractional regression models examined effects of conditions with Benjamini-Hochberg adjustment RESULTS: Fractional regression models showed attaching the CPG document significantly improved precision for examination (β = 1.30, p<0.001) and intervention recommendations (β = 2.74, p<0.001). Attaching the CIT improved precision for examination (β = 1.06, p<0.001) and intervention (β = 0.98, p<0.01). A significant CPG*CIT interaction occurred for recall of examination recommendations (β = 1.39, p<0.001). Attaching the CPG improved F1 score for examination (β = 0.38, p<0.05) and interventions (β = 2.41, p<0.001). Across conditions, F1 scores ranged from 40.2% to 54.5% for examination and from 33.7% to 86.7% for interventions. CONCLUSION: ChatGPT’s accuracy varied by prompt strategy. Attaching guideline resources improved precision and F1 score but not recall. Variability across prompting strategies and accuracy metrics highlights the need for further research.
Peredo et al. (Thu,) studied this question.