What question did this study set out to answer?

The aim is to enhance automated 3D CT scan interpretation by utilizing a modular framework of specialized agents.

May 14, 2026Open Access

VLI-Agent: Vision-Language-Interpretation Agents for Automated 3D CT Understanding

Key Points

The aim is to enhance automated 3D CT scan interpretation by utilizing a modular framework of specialized agents.
Proposed VLI-Agent framework with coordinated specialized agents for interpretation.
Instantiated VLI-Agent in three configurations with varying interpretive capacities.
Evaluated on RadGnome-Chest CT validation set.
Agent-based collaboration improved clinical effectiveness compared to a monolithic baseline.
Full configuration achieved highest diagnostic accuracy with multi-view and knowledge-enhanced interpretation.
Clinical effectiveness improvements may not correlate with surface-level linguistic similarity.

Abstract

Automated interpretation of 3D CT scans remains challenging due to the complexity of volumetric perception, heterogeneous anatomical structures, and the need for clinically faithful report generation. Existing approaches predominantly rely on monolithic vision–language models that combine global perception, region-level reasoning, and language generation within a single inference pipeline, limiting interpretability and extensibility. In this work, we propose VLI-Agent, an agent-based framework for 3D CT interpretation that explicitly decomposes the interpretation process into coordinated specialized agents. VLI-Agent harmonizes heterogeneous data assumptions across models and orchestrates complementary interpretation agents for global-level summarization, organ-level grounding, and knowledge-enhanced contextual reasoning, enabling flexible integration of diverse vision–language models without modifying their internal architectures. We instantiate VLI-Agent in three configurations with increasing interpretive capacity and evaluate them on the RadGnome-Chest CT validation set. Experimental results demonstrate that agent-based collaboration consistently improves clinical effectiveness compared to a monolithic baseline, with the full configuration achieving the highest diagnostic accuracy by integrating multi-view and knowledge-enhanced interpretation while maintaining competitive language generation quality. Further analysis reveals that improvements in clinical effectiveness may not always correlate with surface-level linguistic similarity, highlighting the importance of clinically oriented evaluation. Overall, VLI-Agent provides a modular and extensible foundation for automated 3D CT interpretation, offering a principled pathway toward more accurate, interpretable, and clinically aligned vision–language systems for medical imaging.

VLI-Agent: Vision-Language-Interpretation Agents for Automated 3D CT Understanding

Key Points

Abstract

Cite This Study