What type of study is this?

This is a Literature Review study.

September 20, 2025

Words Over Pixels? Rethinking Vision in Multimodal Large Language Models

Key Points

MLLMs often rely on textual cues instead of effectively utilizing visual information.
Weak spatial reasoning limits the effectiveness of visual understanding in MLLMs.
Current vision encoders show limitations and impact performance on downstream tasks.
There is an urgent need to improve cross-modal alignment for more reliable MLLMs.

Abstract

Multimodal Large Language Models (MLLMs) promise seamless integration of vision and language understanding. However, despite their strong performance, recent studies reveal that MLLMs often fail to effectively utilize visual information, frequently relying on textual cues instead. This survey provides a comprehensive analysis of the vision component in MLLMs, covering both application-level and architectural aspects. We investigate critical challenges such as weak spatial reasoning, poor fine-grained visual perception, and suboptimal fusion of visual and textual modalities. Additionally, we explore limitations in current vision encoders, benchmark inconsistencies, and their implications for downstream tasks. By synthesizing recent advancements, we highlight key research opportunities to enhance visual understanding, improve cross-modal alignment, and develop more robust and efficient MLLMs. Our observations emphasize the urgent need to elevate vision to an equal footing with language, paving the path for more reliable and perceptually aware multimodal models.

AI에게 질문

Bookmark

AI에게 질문

Bookmark

Words Over Pixels? Rethinking Vision in Multimodal Large Language Models

Key Points

Abstract

Cite This Study