In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We propose a novel approach to narrow the gap by mining the potential of VLMs for better performance across various cross-modal tasks. It tackles the following questions: (1) How can high-resolution visual tokens improve image understanding without lengthening the token sequence? (2) How to improve reasoning and generation abilities of VLM with high-quality data? (3) How to close the gap between open-source VLMs and proprietary models on reasoning-driven generation? In particular, to enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. The proposed model supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B, which achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. It is demonstrated to attain 80.6% accuracy on the MMB benchmark (+5.4 vs Gemini Pro) and 74.1% on TextVQA (+4.6 vs LLaVA-NeXT), achieving leading performance in several zero-shot benchmarks and even surpasses the developed private models. Furthermore, Mini-Gemini is proven to improve consistently with stronger LLM, visual encoder, and data in experiments.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yanwei Li
Yuechen Zhang
Chengyao Wang
IEEE Transactions on Pattern Analysis and Machine Intelligence
University of Hong Kong
Chinese University of Hong Kong
Hong Kong University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Wed,) studied this question.
www.synapsesocial.com/papers/692b943e1d383f2b2a378aee — DOI: https://doi.org/10.1109/tpami.2025.3637265