Vision-language models (VLMs) have moved from task-specific image-text encoders to general multimodal foundation models capable of visual reasoning, captioning, retrieval, question answering, grounding, optical character recognition, instruction following and multi-image interaction. This review analyzes twenty influential papers and technical reports published mainly between 2021 and 2025, including CLIP, ALIGN, ALBEF, BLIP, Flamingo, CoCa, PaLI, BLIP-2, LLaVA, MiniGPT-4, InstructBLIP, Kosmos-2, Qwen-VL, CogVLM, GPT-4V, Gemini, InternVL, MM1, PaliGemma and Molmo/PixMo. The purpose is to identify how the field has changed in architecture, data construction, training objectives, evaluation practice and deployment challenges. The review shows that early contrastive alignment created transferable visual representations, while recent models increasingly connect strong vision encoders with large language models through lightweight adapters, query transformers, visual experts, instruction tuning and interleaved multimodal data. Current progress is driven not only by model scale, but also by the quality of captions, grounding data, OCR-rich samples, instruction datasets and safety evaluation. The main unresolved problems remain visual hallucination, weak spatial grounding, limited transparency of training data, high computational cost, benchmark saturation, and insufficient reliability in high-stakes domains. The paper concludes that the next stage of VLM research will depend on data-centric training, verifiable grounding, efficient open models, and evaluation protocols that measure real-world visual reasoning rather than only benchmark accuracy.
Geldibayev et al. (Tue,) studied this question.