What does this research mean for the field?

Large language models provide generally accurate and moderately reliable responses to patient questions about lateral epicondylitis, but consistently produce text that is too difficult for average patients to read. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

June 1, 2026

Large Language Models’ Responses to Patient Questions on Lateral Epicondylitis: Multi- Institutional Orthopaedic Surgeon Evaluation

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Background: Lateral epicondylitis (tennis elbow) is a common cause of elbow pain. With the increasing use of the internet and artificial intelligence (AI) for health information, large language models (LLMs) are frequently consulted by patients. This study aimed to evaluate the accuracy, reliability, content quality, and readability of responses provided by different large language models (ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot) to frequently asked patient questions about lateral epicondylitis.Methods: The author committee reviewed patient-oriented questions on lateral epicondylitis using Google searches and selected the 12 most frequently asked questions for inclusion. These questions were presented to four LLMs: ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot. Responses were evaluated for accuracy using a five-point Likert scale, reliability using the modified DISCERN scale, quality using the Global Quality Scale (GQS), and readability using the Flesch Reading Ease Score (FRES).Results: Perceived medical accuracy did not differ significantly among the LLMs (p = 0.579). Reliability differed significantly (modified DISCERN: p 0.001), with Copilot and Gemini achieving higher scores than ChatGPT-4 (both p 0.001) and Copilot also outperforming ChatGPT-3.5 (p = 0.002). Quality differed significantly (GQS: p 0.001), with ChatGPT-3.5 and Gemini scoring higher than ChatGPT-4 (p = 0.001 and p = 0.006, respectively). Readability differed across models (FRES: p = 0.049); Gemini demonstrated higher readability than ChatGPT-3.5 (p = 0.040), while responses from all models were generally difficult to read. Response generation time differed significantly (p 0.001), with ChatGPT-4 producing the slowest responses.Conclusions: All evaluated LLMs provided generally accurate and moderately reliable responses to questions about tennis elbow, with differences observed across specific quality domains such as source transparency, readability, and response time. Models with citation capabilities demonstrated higher reliability in terms of source transparency, while readability remained a common limitation. LLMs show potential as supplementary patient information tools in orthopaedic; however, further refinement and improved readability are needed before widespread clinical use.

Me gusta

Guardar

Ver artículo completo