In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We propose a novel approach to narrow the gap by mining the potential of VLMs for better performance across various cross-modal tasks. It tackles the following questions: (1) How can high-resolution visual tokens improve image understanding without lengthening the token sequence? (2) How to improve reasoning and generation abilities of VLM with high-quality data? (3) How to close the gap between open-source VLMs and proprietary models on reasoning-driven generation? In particular, to enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. The proposed model supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B, which achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. It is demonstrated to attain 80.6% accuracy on the MMB benchmark (+5.4 vs Gemini Pro) and 74.1% on TextVQA (+4.6 vs LLaVA-NeXT), achieving leading performance in several zero-shot benchmarks and even surpasses the developed private models. Furthermore, Mini-Gemini is proven to improve consistently with stronger LLM, visual encoder, and data in experiments.
Li et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: