What question did this study set out to answer?

This study evaluates how well ChatGPT provides consistent and accurate responses to orthopedic trauma questions compared to experts and a textbook.

June 4, 2026Open Access

ChatGPT in Orthopedic Trauma: Consistency, Accuracy, and Agreement With Textbook and Expert Opinion

Key Points

This study evaluates how well ChatGPT provides consistent and accurate responses to orthopedic trauma questions compared to experts and a textbook.
Compared ChatGPT's responses on orthopedic questions with expert consensus and a standard textbook.
Used cosine similarity and BERT-based models to measure textual similarity and readability indices.
Employed Cohen's Kappa to assess agreement between ChatGPT, expert opinions, and textbook content.
ChatGPT showed high internal consistency with mean cosine similarity of 0.94-0.98.
Partial agreement with textbook content (mean similarity 0.88; Cohen's Kappa 0.16) and expert opinion (Cohen's Kappa 0.16).
Discrepancies noted in specific clinical recommendations, indicating reliance on outdated information.

Abstract

Background Artificial intelligence (AI) tools such as ChatGPT are increasingly used in clinical settings, yet their reliability remains unclear. Methods This study compared ChatGPT-4o's responses to clinically relevant orthopedic questions on femoral neck and trochanteric fractures with expert consensus and Rockwood and Green's textbook. Seven questions were submitted in repeated sessions to assess consistency. Textual similarity was measured using cosine similarity and Bidirectional Encoder Representations from Transformers (BERT)-based models, alongside readability indices and Cohen's Kappa for agreement. Results ChatGPT demonstrated high internal consistency (mean cosine similarity: 0.94-0.98) but only partial agreement with textbook content (mean similarity: 0.88; Cohen's Kappa: 0.16) and expert opinion survey (Cohen's Kappa: 0.16). Discrepancies were noted in age thresholds for arthroplasty and preferred nail length in subtrochanteric fractures, suggesting reliance on outdated sources. Conclusion Although ChatGPT provided coherent and readable answers, knowledge gaps persist. This highlights the need for curated, domain-specific models and transparent data sourcing. AI shows promise but should complement, not replace, clinical judgment and up-to-date references.

Bookmark

View Full Paper

Bookmark

View Full Paper

ChatGPT in Orthopedic Trauma: Consistency, Accuracy, and Agreement With Textbook and Expert Opinion

Key Points

Abstract

Cite This Study