Abstract Purpose The purpose of this study was to evaluate whether Chat Generative Pre‐trained Transformer (ChatGPT; Version 5.1) can reproduce frequentist meta‐analytic calculations with an accuracy comparable to established statistical software in orthopaedic research. Methods In this methodological comparison study, data from two previously published orthopaedic meta‐analyses with identical statistical architectures as reference standards were used. Between‐study variance ( τ 2 ) was estimated using the Sidik–Jonkman method and uncertainty was quantified using the Hartung–Knapp adjustment for the random‐effects models, while common‐effect models assume τ 2 = 0. Original data extraction tables were provided to ChatGPT‐5.1, which was instructed to perform the same analyses. ChatGPT‐generated pooled mean differences, confidence intervals and heterogeneity statistics ( I 2 , τ 2 , p values) were compared with verified reference results obtained using the meta and metafor packages in R. Results Across seven evaluated outcomes, ChatGPT‐5.1 reproduced the direction of effects in all cases. Deviations compared with reference meta‐analyses were classified as minor in three outcomes (43%), moderate in one outcome (14%) and major in three outcomes (43%). Agreement was highest in low‐heterogeneity settings, whereas substantial deviations occurred in outcomes with pronounced between‐study heterogeneity, particularly under random‐effects models. Conclusion ChatGPT‐5.1 demonstrates emerging capability to approximate frequentist meta‐analytic calculations, particularly in low‐heterogeneity settings. However, its tendency to underestimate between‐study variability and to deviate in complex random‐effects scenarios limits its reliability as a standalone tool. At present, large language models may support exploratory analyses but cannot fully replace dedicated statistical software for meta‐analyses in orthopaedic research. Level of Evidence Level III.
Salzmann et al. (Wed,) studied this question.