What question did this study set out to answer?

The study aims to evaluate how changes to the CRISPE prompt framework affect ChatGPT’s accuracy in physical therapy recommendations for neck pain.

April 25, 2026

Using the CRISPE Framework to Assess ChatGPT’s Accuracy of Physical Therapy Examination and Intervention Recommendations for Neck Pain

Key Points

The study aims to evaluate how changes to the CRISPE prompt framework affect ChatGPT’s accuracy in physical therapy recommendations for neck pain.
Used a 3-by-2 factorial design with ChatGPT-4.0 prompts based on CRISPE framework.
Evaluated outputs for examination (n=72) and interventions (n=144) using precision, recall, and F1 score methods.
Applied fractional regression models and assessed inter-rater reliability with intraclass correlation coefficients.
Attaching clinical practice guidelines (CPG) significantly improved precision for examination (β = 1.30, p<0.001) and intervention recommendations (β = 2.74, p<0.001).
Attaching a custom clinical inference table (CIT) improved examination precision (β = 1.06, p<0.001) and intervention precision (β = 0.98, p<0.01).
Significant interaction between CPG and CIT affected recall of examination recommendations (β = 1.39, p<0.001).

Abstract

OBJECTIVE: Examine how manipulating the insight component of the CRISPE (Capacity and Role, Insight, Statement, Personality, Experiment) prompt framework, affects ChatGPT’s accuracy of physical therapy neck pain recommendations. DESIGN: 3-by-2 factorial design. METHODS: Prompts for ChatGPT-4.0 were developed using CRISPE with increases in insight on clinical practice guidelines (CPG) and/or a custom clinical inference table (CIT). Examination (n = 72) and interventions (n = 144) outputs were evaluated by independent teams using precision, recall, and F1 score. Inter-rater reliability was assessed using intraclass correlation coefficients. Fractional regression models examined effects of conditions with Benjamini-Hochberg adjustment RESULTS: Fractional regression models showed attaching the CPG document significantly improved precision for examination (β = 1.30, p<0.001) and intervention recommendations (β = 2.74, p<0.001). Attaching the CIT improved precision for examination (β = 1.06, p<0.001) and intervention (β = 0.98, p<0.01). A significant CPG*CIT interaction occurred for recall of examination recommendations (β = 1.39, p<0.001). Attaching the CPG improved F1 score for examination (β = 0.38, p<0.05) and interventions (β = 2.41, p<0.001). Across conditions, F1 scores ranged from 40.2% to 54.5% for examination and from 33.7% to 86.7% for interventions. CONCLUSION: ChatGPT’s accuracy varied by prompt strategy. Attaching guideline resources improved precision and F1 score but not recall. Variability across prompting strategies and accuracy metrics highlights the need for further research.

Bookmark

Using the CRISPE Framework to Assess ChatGPT’s Accuracy of Physical Therapy Examination and Intervention Recommendations for Neck Pain

Key Points

Abstract

Cite This Study