Multimodal Large Language Models (MLLMs) promise seamless integration of vision and language understanding. However, despite their strong performance, recent studies reveal that MLLMs often fail to effectively utilize visual information, frequently relying on textual cues instead. This survey provides a comprehensive analysis of the vision component in MLLMs, covering both application-level and architectural aspects. We investigate critical challenges such as weak spatial reasoning, poor fine-grained visual perception, and suboptimal fusion of visual and textual modalities. Additionally, we explore limitations in current vision encoders, benchmark inconsistencies, and their implications for downstream tasks. By synthesizing recent advancements, we highlight key research opportunities to enhance visual understanding, improve cross-modal alignment, and develop more robust and efficient MLLMs. Our observations emphasize the urgent need to elevate vision to an equal footing with language, paving the path for more reliable and perceptually aware multimodal models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Anubhooti Jain
Mayank Vatsa
Richa Singh
Indian Institute of Technology Jodhpur
Building similarity graph...
Analyzing shared references across papers
Loading...
Jain et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68d469d631b076d99fa66f01 — DOI: https://doi.org/10.24963/ijcai.2025/1164
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: