What question did this study set out to answer?

The objective is to evaluate the effectiveness of artificial intelligence chatbots in answering pediatric gastroenterology board-style questions.

April 10, 2026Open Access

Can artificial intelligence pass the test? Evaluating chatbot scores on pediatric gastroenterology board‐style questions

Read Full Paperexternally

Key Points

The objective is to evaluate the effectiveness of artificial intelligence chatbots in answering pediatric gastroenterology board-style questions.
Evaluated GPT-3.5, GPT-4o, and Copilot on board-style questions.
Analyzed performance on incomplete clinical vignettes without media files.
Assessed impact of minimal prompting methodology on AI performance.
21% of questions lacked necessary media files for accurate responses.
Noted variability of AI outputs and the need for multiple runs for reliable results.
Insufficient sample sizes in categories affected performance comparison validity.

Abstract

I read with interest the study by Roberts et al.1 I identify several issues that warrant consideration. Loss of media files in text-only transfer represents a fundamental limitation. Clinical gastroenterology relies heavily on visual interpretation, including endoscopic images, radiographs, and other imaging modalities. The authors acknowledge that 46 questions (21%) contained media files that were not transmitted to chatbots. Testing Artificial Intelligence on incomplete clinical vignettes does not accurately reflect the multimodal nature of board examinations. Modern Large Language Models (LLMs) with native vision capabilities, such as GPT-4V, Gemini Pro Vision, and Claude 3, were available at the time of the study and would have provided a more valid assessment.2, 3 Another issue is the model selection. They evaluated GPT-3.5, GPT-4o, and Copilot, but omitted other major models, including Claude, Gemini, and specialized medical LLMs such as Med-PaLM 2, which have demonstrated impressive performance on medical licensing examinations.4 The prompting methodology was minimal. The authors used a simple instruction: “Please answer the following questions.” Research has demonstrated that prompting techniques significantly influence LLM performance on medical reasoning tasks. Thoughtful structured prompts have been shown to significantly improve accuracy.5, 6 Additionally, each question was tested only once per model. LLMs exhibit variable outputs for identical inputs. Best practices in LLM evaluation recommend multiple runs to establish reliable estimates.4 Single-run testing introduces unnecessary measurement uncertainty. Furthermore, the domain-level analysis suffers from limited statistical power. Several categories contained only 2–6 questions, making performance comparisons unreliable. Using such small samples is likely to lead to overgeneralization. The author has no funding to report. The author declares no conflict of interest.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zvi Weizman

Journals

JPGN Reports

Actions

Institutions

Ben-Gurion University of the Negev

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Can artificial intelligence pass the test? Evaluating chatbot scores on pediatric gastroenterology board‐style questions

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider