November 1, 2025Open Access

Enhancing Large Vision-Language Models via Quantized Grounded Reasoning

Key Points

Abstract

Large Vision-Language Models (LVLMs) have achieved strong results in general visual understanding but remain limited in fine-grained visual reasoning. This paper introduces LVLM-GR, a framework designed to improve detailed visual grounding and robust multimodal reasoning. The proposed Visual Concept Quantizer (VCQ) encodes images into discrete visual tokens through context-aware pooling and a semantic hierarchical codebook, effectively preserving fine-grained semantics. These visual tokens are then aligned with language via a lightweight Grounded Reasoning Adapter (GRA) based on LoRA-tuned adaptation atop a frozen LLaVA 1.5 13B backbone. Experiments on GQA, RefCOCO+, and A-OKVQA show that LVLM-GR achieves superior performance in fine-grained visual understanding, reasoning, and grounding, highlighting its potential for complex multimodal reasoning tasks in material-level and detailed visual analysis.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper

Cite This Study

Reed et al. (Thu,) studied this question.

synapsesocial.com/papers/69054ffa1a99e50463de68ec https://doi.org/https://doi.org/10.20944/preprints202510.2397.v1

Ask AI

Helpful

Bookmark

View Full Paper