Background Artificial intelligence (AI) tools such as ChatGPT are increasingly used in clinical settings, yet their reliability remains unclear. Methods This study compared ChatGPT-4o's responses to clinically relevant orthopedic questions on femoral neck and trochanteric fractures with expert consensus and Rockwood and Green's textbook. Seven questions were submitted in repeated sessions to assess consistency. Textual similarity was measured using cosine similarity and Bidirectional Encoder Representations from Transformers (BERT)-based models, alongside readability indices and Cohen's Kappa for agreement. Results ChatGPT demonstrated high internal consistency (mean cosine similarity: 0.94-0.98) but only partial agreement with textbook content (mean similarity: 0.88; Cohen's Kappa: 0.16) and expert opinion survey (Cohen's Kappa: 0.16). Discrepancies were noted in age thresholds for arthroplasty and preferred nail length in subtrochanteric fractures, suggesting reliance on outdated sources. Conclusion Although ChatGPT provided coherent and readable answers, knowledge gaps persist. This highlights the need for curated, domain-specific models and transparent data sourcing. AI shows promise but should complement, not replace, clinical judgment and up-to-date references.
Ronel et al. (Mon,) studied this question.