What question did this study set out to answer?

The aim is to evaluate the diagnostic and treatment capabilities of various large language models (LLMs) for disorders of gut-brain interaction (DGBIs).

February 16, 2026Open Access

Poster Session II A333 ARTIFICIAL INTELLIGENCE MODELS DEMONSTRATE PROMISING BUT SUBOPTIMAL PERFORMANCE IN DIAGNOSING AND TREATING DISORDERS OF GUT-BRAIN INTERACTION

Key Points

The aim is to evaluate the diagnostic and treatment capabilities of various large language models (LLMs) for disorders of gut-brain interaction (DGBIs).
Analyzed 68 clinical scenarios based on the Rome IV Multidimensional Clinical Profile (MDCP).
Five LLMs, including ChatGPT and Google Gemini, were used to generate diagnoses and treatment options.
Responses were compared with MDCP standards and expert consensus for accuracy.
Perplexity showed the highest diagnostic accuracy at 74%, while Microsoft Copilot and OpenEvidence had the lowest at 65%.
Treatment accuracy ranged from 53-54% for ChatGPT, Google Gemini, and Microsoft Copilot, but was only 41% for OpenEvidence and Perplexity.
Subgroup analysis indicated the best diagnostic performance for child/adolescent conditions and esophageal disorders, with weaker performance for anorectal disorders.

Abstract

Abstract Background Diagnosing disorders of gut-brain interaction (DGBIs) remains a persistent challenge in gastroenterology and primary care. Large language artificial intelligence models (LLM) have emerged as a potential tool to support clinical decision-making, yet their clinical applicability remains unclear. Aims To compare the diagnostic accuracy and treatment recommendations generated from five LLMs using clinical scenarios from the Rome IV Multidimensional Clinical Profile (MDCP). Methods 68 case scenarios representing various DGBIs were entered into Chat GPT 4.0, Google Gemini 2.5 Pro, Microsoft Copilot, OpenEvidence, and Perplexity. Each model was provided a standardized prompt to produce a diagnosis and treatment options based on the case scenario. Responses were evaluated against MDCP standards and expert consensus for diagnostic and treatment accuracy. For diagnostic accuracy, a response was correct if it matched the MDCP diagnosis or used an accepted historical term. Partially correct responses identified a clinical modifier but an incorrect MDCP diagnosis or only a part of the MDCP diagnosis. Incorrect responses did not match the MDCP diagnosis, identified an incorrect subtype of a correct diagnosis or included unrelated additional diagnoses. Results Diagnostic accuracy was highest with Perplexity (74%), ChatGPT (72%), Google Gemini (72%), and lowest with Microsoft Copilot (65%) and OpenEvidence (65%). Regarding treatment options, ChatGPT, Google Gemini and Microsoft Copilot each generated appropriate treatment options in between 53 to 54% of cases, compared with 41% each for OpenEvidence and Perplexity. When a subgroup analysis was conducted based on the MDCP diagnostic category, LLMs had the highest accuracy diagnosing child/adolescent conditions or esophageal disorders and poorest accuracy diagnosing anorectal disorders. For treatment options, LLMs had the highest accuracy for neonate/toddler disorders and gallbladder/sphincter of Oddi disorders and the poorest accuracy for gastroduodenal disorders. Statistical analysis revealed no significant differences between LLMs for diagnostic (χ2(4) = 4.26, p = 0.372) or treatment accuracy (χ2(4) = 8.069, p = 0.089). Conclusions LLMs demonstrated moderately accurate performance in determining the diagnosis of and treatment options for DGBIs. Diagnostic errors and omission of several options in the treatment of DGBIs highlights the risk of premature clinical adoption and need for further validation on strategies for clinical integration. Funding Agencies None

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper