Automated interpretation of 3D CT scans remains challenging due to the complexity of volumetric perception, heterogeneous anatomical structures, and the need for clinically faithful report generation. Existing approaches predominantly rely on monolithic vision–language models that combine global perception, region-level reasoning, and language generation within a single inference pipeline, limiting interpretability and extensibility. In this work, we propose VLI-Agent, an agent-based framework for 3D CT interpretation that explicitly decomposes the interpretation process into coordinated specialized agents. VLI-Agent harmonizes heterogeneous data assumptions across models and orchestrates complementary interpretation agents for global-level summarization, organ-level grounding, and knowledge-enhanced contextual reasoning, enabling flexible integration of diverse vision–language models without modifying their internal architectures. We instantiate VLI-Agent in three configurations with increasing interpretive capacity and evaluate them on the RadGnome-Chest CT validation set. Experimental results demonstrate that agent-based collaboration consistently improves clinical effectiveness compared to a monolithic baseline, with the full configuration achieving the highest diagnostic accuracy by integrating multi-view and knowledge-enhanced interpretation while maintaining competitive language generation quality. Further analysis reveals that improvements in clinical effectiveness may not always correlate with surface-level linguistic similarity, highlighting the importance of clinically oriented evaluation. Overall, VLI-Agent provides a modular and extensible foundation for automated 3D CT interpretation, offering a principled pathway toward more accurate, interpretable, and clinically aligned vision–language systems for medical imaging.
Teng et al. (Fri,) studied this question.