Building a powerful vision-language model (VLM) necessitates a holistic system design encompassing model architecture, data curation, and training paradigms. In this paper, we present a longitudinal study of the InternVL series (v1.0-v3.0), distilling its technical evolution into a systematic framework for constructing large-scale, high-performance VLMs. This framework is characterized by three pivotal technical shifts: 1) Perceptual Scaling: We develop a 6-billion parameter vision encoder (InternViT-6B) and introduce a VLM-oriented alignment strategy, which bridges the representation gap between vision and language while enabling fine-grained, high-resolution perception. 2) Multimodal Alignment Scaling: We implement a multimodal dynamic high-resolution (mDHR) mechanism that provides a unified interface for single-image, multi-image, and video inputs. Combined with massive data curation and multi-scale model expansion, this shift pushes the performance frontier through systematic scaling. 3) Native Multimodal Pre-training: We transition from decoupled multi-stage tuning to a native multimodal continual pre-training paradigm. By jointly optimizing interleaved multimodal and text-only data, the model achieves a deep synergy that preserves linguistic proficiency while internalizing visual-world knowledge. Extensive evaluations across a broad range of benchmarks demonstrate that models built upon this framework achieve state-of-the-art performance among open-source VLMs and rival leading proprietary systems. By formalizing these design principles, we offer a reproducible roadmap for future multimodal research. Code and models are available at https://github.com/OpenGVLab/InternVL.
Chen et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: