What question did this study set out to answer?

The review aims to trace the evolution of vision-language models and identify changes in architecture and deployment practices.

June 4, 2026Open Access

Vision-Language Models in the Era of Multimodal Foundation Models

Key Points

The review aims to trace the evolution of vision-language models and identify changes in architecture and deployment practices.
Analyzed twenty influential papers and technical reports published mainly between 2021 and 2025.
Focused on models such as CLIP, ALIGN, ALBEF, and others.
Discussed changes in architectural design, training objectives, and evaluation practices.
Early models used contrastive alignment for visual representation transfer.
Recent models connect vision encoders with language models via adapters and query transformers.
Ongoing challenges include visual hallucination, weak spatial grounding, and high computational costs.

Abstract

Vision-language models (VLMs) have moved from task-specific image-text encoders to general multimodal foundation models capable of visual reasoning, captioning, retrieval, question answering, grounding, optical character recognition, instruction following and multi-image interaction. This review analyzes twenty influential papers and technical reports published mainly between 2021 and 2025, including CLIP, ALIGN, ALBEF, BLIP, Flamingo, CoCa, PaLI, BLIP-2, LLaVA, MiniGPT-4, InstructBLIP, Kosmos-2, Qwen-VL, CogVLM, GPT-4V, Gemini, InternVL, MM1, PaliGemma and Molmo/PixMo. The purpose is to identify how the field has changed in architecture, data construction, training objectives, evaluation practice and deployment challenges. The review shows that early contrastive alignment created transferable visual representations, while recent models increasingly connect strong vision encoders with large language models through lightweight adapters, query transformers, visual experts, instruction tuning and interleaved multimodal data. Current progress is driven not only by model scale, but also by the quality of captions, grounding data, OCR-rich samples, instruction datasets and safety evaluation. The main unresolved problems remain visual hallucination, weak spatial grounding, limited transparency of training data, high computational cost, benchmark saturation, and insufficient reliability in high-stakes domains. The paper concludes that the next stage of VLM research will depend on data-centric training, verifiable grounding, efficient open models, and evaluation protocols that measure real-world visual reasoning rather than only benchmark accuracy.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper