What does this research mean for the field?

ChatGPT 5.1 can reproduce frequentist meta-analytic calculations with accuracy comparable to established statistical software in orthopaedic research, particularly in low-heterogeneity settings. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This study aims to assess whether ChatGPT 5.1 can replicate frequentist meta-analytic calculations accurately.

March 13, 2026Open Access

Large language models are comparable with commonly used statistical software: A validation of GPT 5.1 for frequentist meta‐analysis in orthopaedics

Key Points

This study aims to assess whether ChatGPT 5.1 can replicate frequentist meta-analytic calculations accurately.
Utilized data from two prior orthopaedic meta-analyses as reference standards.
Estimated between-study variance using the Sidik–Jonkman method.
Quantified uncertainty with the Hartung–Knapp adjustment for random-effects models.
Performed analyses using original data extraction tables provided to ChatGPT-5.1.
Compared ChatGPT-generated results with verified reference results from R packages meta and metafor.
ChatGPT-5.1 reproduced the direction of effects in all seven outcomes evaluated.
Deviations from reference results were classified as minor (43%), moderate (14%), and major (43%).
Agreement was higher in low-heterogeneity scenarios, with substantial deviations noted in high-heterogeneity outcomes.

Abstract

Abstract Purpose The purpose of this study was to evaluate whether Chat Generative Pre‐trained Transformer (ChatGPT; Version 5.1) can reproduce frequentist meta‐analytic calculations with an accuracy comparable to established statistical software in orthopaedic research. Methods In this methodological comparison study, data from two previously published orthopaedic meta‐analyses with identical statistical architectures as reference standards were used. Between‐study variance ( τ 2 ) was estimated using the Sidik–Jonkman method and uncertainty was quantified using the Hartung–Knapp adjustment for the random‐effects models, while common‐effect models assume τ 2 = 0. Original data extraction tables were provided to ChatGPT‐5.1, which was instructed to perform the same analyses. ChatGPT‐generated pooled mean differences, confidence intervals and heterogeneity statistics ( I 2 , τ 2 , p values) were compared with verified reference results obtained using the meta and metafor packages in R. Results Across seven evaluated outcomes, ChatGPT‐5.1 reproduced the direction of effects in all cases. Deviations compared with reference meta‐analyses were classified as minor in three outcomes (43%), moderate in one outcome (14%) and major in three outcomes (43%). Agreement was highest in low‐heterogeneity settings, whereas substantial deviations occurred in outcomes with pronounced between‐study heterogeneity, particularly under random‐effects models. Conclusion ChatGPT‐5.1 demonstrates emerging capability to approximate frequentist meta‐analytic calculations, particularly in low‐heterogeneity settings. However, its tendency to underestimate between‐study variability and to deviate in complex random‐effects scenarios limits its reliability as a standalone tool. At present, large language models may support exploratory analyses but cannot fully replace dedicated statistical software for meta‐analyses in orthopaedic research. Level of Evidence Level III.

Large language models are comparable with commonly used statistical software: A validation of GPT 5.1 for frequentist meta‐analysis in orthopaedics

Key Points

Abstract

Cite This Study