What question did this study set out to answer?

To assess whether a multimodal large language model produces adequate reports for EGD and if a CAD overlay affects performance.

March 12, 2026Open Access

Large Language Model‐Driven Analysis and Report Generation of Endoscopy Videos—A Pilot Study

Key Points

To assess whether a multimodal large language model produces adequate reports for EGD and if a CAD overlay affects performance.
Analyzed five complete EGD videos using Gemini 2.5 Pro in two versions: clean video and CAD overlay.
Five blinded endoscopists rated report adequacy in three domains.
Accuracy of MLLM for landmarks and lesions assessed by two expert endoscopists using a time-window rule.
MMMLM report completeness rated adequate in 56% with clean video vs. 48% with overlay (p=0.500).
Visualization ratings were identical for both conditions at 36%.
Landmark agreement accuracy was higher with clean video (0.55) compared to overlay (0.33), p=0.029.

Abstract

Multimodal large language models (MLLMs) can automatically analyze clinical video, but evidence from full esophagogastroduodenoscopy (EGD) and the impact of on-screen computer-aided detection/diagnosis (CAD) overlays on MLLM behavior remain unclear. We tested whether an MLLM can produce clinically adequate EGD reports and whether a CAD overlay changes performance. We analyzed five complete EGD videos with Gemini 2.5 Pro in paired versions: (1) clean video and (2) the same video with a CAD overlay. Five blinded endoscopists rated report adequacy in three domains. MLLM accuracy for landmarks/lesions was further assessed by two blinded expert endoscopists using the time-window rule (a model detection counted as correct if it occurred within ±2 s of the expert-annotated timestamp). In this retrospective pilot study, five archived diagnostic EGD procedures from five patients were available as full-length videos. Across five raters, MLLM Completeness was judged adequate in 56.0% (14/25 ratings) with Clean-Video versus 48.0% (12/25 ratings) with Overlay-Video (p = 0.500). Visualization was identical (36.0% 9/25 ratings for both; p = 1.000). Lesions characteristics were identical (16.0% 4/25 for both; p = 1.00). For the Landmark agreement, the overall accuracy of the MLLM with Clean-Video vs. Overlay-Video was: 0.55 95% CI 0.43-0.67 vs. 0.33 0.23-0.46, p = 0.029; sensitivity 0.53 0.40-0.66 vs. 0.35 0.24-0.49, p = 0.122; specificity 0.67 0.35-0.88 vs. 0.22 0.06-0.55, p = 0.125. In this pilot study, Gemini 2.5 Pro demonstrated inadequate performance for clinical EGD reporting. These hypothesis-generating findings suggest substantial optimization and larger-scale validation are required before deployment.

Large Language Model‐Driven Analysis and Report Generation of Endoscopy Videos—A Pilot Study

Key Points

Abstract

Cite This Study