3D reasoning is crucial in areas like robotics and autonomous driving. Due to the high cost of 3D data acquisition, some recent methods attempt to enable LLMs to perform 3D reasoning through multi-view images, thereby transferring the powerful 2D reasoning capabilities of LLMs to 3D environments. However, these methods face challenges: either they use redundant views that contain many perspectives irrelevant to the question, or they rely on globally aggregated multi-view representations, losing the fine-grained vision-language correlations. To tackle these challenges, we propose 3DMulti-LLM, which mainly consists of three components: a COT selector, a question-guided fusion block, and pre-trained LLMs. Specifically, first, the COT selector leverages the powerful chain-of-thought reasoning capabilities of LLMs to identify question-related multi-view images. In this way, 3DMulti-LLM can eliminate a substantial amount of interference from unnecessary viewpoints. Then, we propose a question-guided fusion block for integrating multi-view features via question-guided interaction among various viewpoints. Finally, the pre-trained LLMs are utilized to reason in 3D scenes directly through multi-view features. Notably, our approach understands the 3D scene solely through multi-view images, without requiring the input of point cloud information or additional 3D feature extraction. Through our experiments, 3DMulti-LLM achieves impressive performance and surpasses existing 3D-input-free methods by +12.2% and +7.1% on ScanQA and 3DMV-VQA datasets, respectively.
Building similarity graph...
Analyzing shared references across papers
Loading...
Boqiang Xu
Jianmin Wu
Wei Zhang
IEEE Transactions on Image Processing
Chinese Academy of Sciences
Building similarity graph...
Analyzing shared references across papers
Loading...
Xu et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69e5c2d003c2939914028c82 — DOI: https://doi.org/10.1109/tip.2026.3683282
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: