Large Vision-Language Models (LVLMs) have achieved strong results in general visual understanding but remain limited in fine-grained visual reasoning. This paper introduces LVLM-GR, a framework designed to improve detailed visual grounding and robust multimodal reasoning. The proposed Visual Concept Quantizer (VCQ) encodes images into discrete visual tokens through context-aware pooling and a semantic hierarchical codebook, effectively preserving fine-grained semantics. These visual tokens are then aligned with language via a lightweight Grounded Reasoning Adapter (GRA) based on LoRA-tuned adaptation atop a frozen LLaVA 1.5 13B backbone. Experiments on GQA, RefCOCO+, and A-OKVQA show that LVLM-GR achieves superior performance in fine-grained visual understanding, reasoning, and grounding, highlighting its potential for complex multimodal reasoning tasks in material-level and detailed visual analysis.
Building similarity graph...
Analyzing shared references across papers
Loading...
Reed et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69054ffa1a99e50463de68ec — DOI: https://doi.org/10.20944/preprints202510.2397.v1
Elijah Reed
Jeremy Barnes
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: