Multimodal large language models (MLLMs) can automatically analyze clinical video, but evidence from full esophagogastroduodenoscopy (EGD) and the impact of on-screen computer-aided detection/diagnosis (CAD) overlays on MLLM behavior remain unclear. We tested whether an MLLM can produce clinically adequate EGD reports and whether a CAD overlay changes performance. We analyzed five complete EGD videos with Gemini 2.5 Pro in paired versions: (1) clean video and (2) the same video with a CAD overlay. Five blinded endoscopists rated report adequacy in three domains. MLLM accuracy for landmarks/lesions was further assessed by two blinded expert endoscopists using the time-window rule (a model detection counted as correct if it occurred within ±2 s of the expert-annotated timestamp). In this retrospective pilot study, five archived diagnostic EGD procedures from five patients were available as full-length videos. Across five raters, MLLM Completeness was judged adequate in 56.0% (14/25 ratings) with Clean-Video versus 48.0% (12/25 ratings) with Overlay-Video (p = 0.500). Visualization was identical (36.0% 9/25 ratings for both; p = 1.000). Lesions characteristics were identical (16.0% 4/25 for both; p = 1.00). For the Landmark agreement, the overall accuracy of the MLLM with Clean-Video vs. Overlay-Video was: 0.55 95% CI 0.43-0.67 vs. 0.33 0.23-0.46, p = 0.029; sensitivity 0.53 0.40-0.66 vs. 0.35 0.24-0.49, p = 0.122; specificity 0.67 0.35-0.88 vs. 0.22 0.06-0.55, p = 0.125. In this pilot study, Gemini 2.5 Pro demonstrated inadequate performance for clinical EGD reporting. These hypothesis-generating findings suggest substantial optimization and larger-scale validation are required before deployment.
Massimi et al. (Sun,) studied this question.