Vision-Language Models have transformed multimodal artificial intelligence, yet a comprehensive synthesis of their architectural evolution, training paradigms, and domain-specific capabilities remains limited. This systematic review, conducted according to PRISMA guidelines, analyzes research from January 2021 to December 2025. From 928 identified records across seven digital libraries, 48 articles were retained for final synthesis. The review establishes a unified taxonomy of VLM architectures, categorizing them by core functional objectives including vision-language understanding, vision-conditioned text generation, and multimodal-to-multimodal synthesis. These are organized alongside architectural families defined through their coupling mechanisms: dual-encoder models optimized via symmetric InfoNCE contrastive loss; fusion-based transformers employing cross-attention for fine-grained grounding; unified single-stream models using prefix language modeling over visual tokens; and modular bridge systems that connect pretrained vision encoders to large language models via query-based adapters such as Q-Former and parameter-efficient tuning via LoRA. The study consolidates disparate training approaches into a multi-objective integration framework, combining contrastive alignment, masked language or image modeling, and reinforcement-based alignment via Group Relative Policy Optimization (GRPO). Ablation studies validate this framework, showing, for instance, a 31.7% accuracy drop on ScienceQA without contrastive pretraining for LLaVA-1.5 and an 18.2% decrease in clinical report accuracy on MIMIC-CXR when GRPO is disabled for MedVLM-R1. The study formalizes the compositionality gap as the KL-divergence between joint and factorized multimodal representations. This diagnostic metric provides a mathematical explanation for performance deficits of 40 to 65% observed on benchmarks like GQA and Winoground, linking these failures to architectural fusion bottlenecks and dataset biases. Besides, this study also explores diverse vertical applications, specifically targeting standard multimodal interfaces, medical image-to-text reasoning, geospatial surveillance, and VLA robotics. We analyzed these sectors to determine how specific architectural configurations adapt to specialized data constraints. Evaluation consistently reveals critical limitations in robustness, interpretability, hallucination control, and out-of-domain generalization. Most studies remain lab-based, highlighting a significant gap between benchmark performance and real-world, safety-critical deployment. The review concludes by charting essential research directions: advancing neuro-symbolic and mixture-of-experts architectures to close the compositionality gap; developing spatiotemporal and multilingual grounding; implementing privacy-aware federated multimodal learning; and creating rigorous evaluation protocols for consistency and safety. This synthesis provides a foundational reference for developing robust, adaptable, and trustworthy next-generation multimodal systems.
Arifur Rahman (Sun,) studied this question.