What question did this study set out to answer?

This research aims to compare the performance of two AI chatbots, ChatGPT 4.0 and Google Gemini, in providing guidance about neck pain.

March 19, 2026

Can AI chatbots guide patients and physicians about neck pain? A reliability and readability comparison of ChatGPT-4 and Gemini

Key Points

This research aims to compare the performance of two AI chatbots, ChatGPT 4.0 and Google Gemini, in providing guidance about neck pain.
Twenty-four patient-oriented questions and four clinical case scenarios were submitted to both chatbots.
Responses were evaluated for reliability, quality, understandability, and readability using validated tools.
Two experienced physicians assessed clinical responses for accuracy, safety, and usability on a 7-point Likert scale.
Gemini showed significantly higher reliability (p < 0.001) than ChatGPT 4.0.
ChatGPT had slightly higher, but not statistically significant, GQS and PEMAT-P scores.
Both chatbots had similar readability metrics, rated difficult to read.

Abstract

Background Artificial intelligence (AI)-based chatbots are increasingly used as sources of medical information. Given the high prevalence of neck pain as a musculoskeletal symptom, patients may commonly consult such tools for health-related guidance. Objective To evaluate and compare the performance of ChatGPT 4.0 and Google Gemini in addressing commonly asked patient questions and clinical case scenarios related to neck pain, focusing on their accuracy, quality, understandability, readability, reliability, and usability. Methods Twenty-four patient-oriented questions and four clinical case scenarios regarding neck pain were submitted to ChatGPT 4.0 and Google Gemini. Responses were evaluated using validated tools: modified DISCERN (mDISCERN) for reliability, Global Quality Scale (GQS) for quality, PEMAT-P for understandability and actionability, and Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) for readability. Case-based responses were assessed for accuracy, safety, and usability on a 7-point Likert scale by two experienced physicians. Results Gemini demonstrated significantly higher reliability (mDISCERN, p < 0.001), whereas ChatGPT 4.0 had slightly higher, though statistically insignificant, GQS and PEMAT-P scores. Readability metrics were similar: ChatGPT's FRE was 48.78 and FKGL 9.08; Gemini's FRE was 47.12 and FKGL 9.11. Both models’ outputs were considered difficult to read. In clinical scenarios, both chatbots showed comparable accuracy, safety, and usability, with minor omissions noted. Conclusion ChatGPT 4.0 and Google Gemini provided similar performance in addressing neck pain-related queries. While both may support patient.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Dicle Rotinda Ozdas Sevgin

Elif Tarihci Cakmak

Gizem Yildirim Ogras

Journals

Journal of Back and Musculoskeletal Rehabilitation

Actions

Institutions

Istanbul University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Can AI chatbots guide patients and physicians about neck pain? A reliability and readability comparison of ChatGPT-4 and Gemini

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider