What question did this study set out to answer?

This study aims to evaluate the accuracy of different AI models in providing guideline-based responses to metabolic and bariatric surgery questions.

March 28, 2026Open Access

Accuracy and Knowledge Base Evaluation of ChatGPT-4o, Gemini-2.0-Flash, and DeepSeek-V3 in Metabolic and Bariatric Surgery: an Expert-Rated Blinded Study

Key Points

This study aims to evaluate the accuracy of different AI models in providing guideline-based responses to metabolic and bariatric surgery questions.
Thirty standardized MBS questions were presented to each AI model.
Responses were evaluated anonymously by 93 MBS experts using a 0–3 scale.
Statistical analysis included repeated-measures ANOVA and reliability assessments via Cronbach’s α and ICC.
DeepSeek-V3 achieved the highest mean score of 2.44, followed by ChatGPT-4o at 1.79 and Gemini-2.0 Flash at 1.63 (p < 0.001).
80% of DeepSeek responses were fully guideline-concordant, compared to 0% for ChatGPT and 3.3% for Gemini.
Reliability assessments demonstrated excellent internal consistency (α > 0.90) and strong inter-rater reliability (ICC > 0.88).

Abstract

Large language models (LLMs) are increasingly applied in medicine; however, their accuracy in guideline-driven, high-stakes specialties, such as metabolic and bariatric surgery (MBS), remains uncertain. This study evaluates the performance of ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-V3 in generating guideline-concordant responses to MBS clinical questions. Thirty standardized, guideline-based MBS questions were presented to each model. Responses were randomized in order, anonymized (blinded as Model A/B/C), and evaluated by 93 MBS experts using a validated 0–3 scale (0 = inaccurate; 3 = fully guideline-concordant). A repeated-measures ANOVA with Bonferroni correction tested model differences; reliability was assessed with Cronbach’s α and intraclass correlation coefficients (ICC). DeepSeek-V3 achieved the highest mean score (2.44 ± 0.40), followed by ChatGPT-4o (1.79 ± 0.46) and Gemini 2.0 Flash (1.63 ± 0.47) (p 0.90), and inter-rater reliability was strong (ICC > 0.88). When mapped against the QUEST evaluation framework, the study addressed Quality and Understanding but did not fully capture Expression, Safety, or Trust dimensions. DeepSeek-V3 outperformed ChatGPT-4o and Gemini 2.0 Flash in generating guideline-concordant responses in MBS. These results highlight the need for ongoing, domain-focused validation before clinical use. This is the first randomized, blinded evaluation comparing ChatGPT-4, Gemini 2.0 Flash, and DeepSeek-V3 in metabolic and bariatric surgery (MBS). DeepSeek-V3 achieved the highest accuracy, with 80% of responses rated fully guideline-concordant, surpassing ChatGPT-4o and Gemini 2.0 Flash. Expert agreement was excellent (Cronbach’s α > 0.90; ICC > 0.88), reinforcing the reliability of scoring. The study partially aligns with the QUEST framework: Quality and Understanding were addressed; Expression, Safety, and Trust require further evaluation. Findings underscore the need for domain-specific validation of LLMs before clinical integration in MBS.

Mark Helpful

Bookmark

Relay

View Full Paper