Large language models (LLMs) are increasingly applied in medicine; however, their accuracy in guideline-driven, high-stakes specialties, such as metabolic and bariatric surgery (MBS), remains uncertain. This study evaluates the performance of ChatGPT-4o, Gemini 2.0 Flash, and DeepSeek-V3 in generating guideline-concordant responses to MBS clinical questions. Thirty standardized, guideline-based MBS questions were presented to each model. Responses were randomized in order, anonymized (blinded as Model A/B/C), and evaluated by 93 MBS experts using a validated 0–3 scale (0 = inaccurate; 3 = fully guideline-concordant). A repeated-measures ANOVA with Bonferroni correction tested model differences; reliability was assessed with Cronbach’s α and intraclass correlation coefficients (ICC). DeepSeek-V3 achieved the highest mean score (2.44 ± 0.40), followed by ChatGPT-4o (1.79 ± 0.46) and Gemini 2.0 Flash (1.63 ± 0.47) (p 0.90), and inter-rater reliability was strong (ICC > 0.88). When mapped against the QUEST evaluation framework, the study addressed Quality and Understanding but did not fully capture Expression, Safety, or Trust dimensions. DeepSeek-V3 outperformed ChatGPT-4o and Gemini 2.0 Flash in generating guideline-concordant responses in MBS. These results highlight the need for ongoing, domain-focused validation before clinical use. This is the first randomized, blinded evaluation comparing ChatGPT-4, Gemini 2.0 Flash, and DeepSeek-V3 in metabolic and bariatric surgery (MBS). DeepSeek-V3 achieved the highest accuracy, with 80% of responses rated fully guideline-concordant, surpassing ChatGPT-4o and Gemini 2.0 Flash. Expert agreement was excellent (Cronbach’s α > 0.90; ICC > 0.88), reinforcing the reliability of scoring. The study partially aligns with the QUEST framework: Quality and Understanding were addressed; Expression, Safety, and Trust require further evaluation. Findings underscore the need for domain-specific validation of LLMs before clinical integration in MBS.
Hany et al. (Wed,) studied this question.