Convolutional Neural Networks (CNNs) excel at local feature extraction but lack global scope, while Vision Transformers (ViT) capture global context but are computationally expensive and lack crucial inductive biases. To resolve this trade-off, CNN-Transformer hybrid models have emerged to synergize these strengths, becoming a dominant architectural paradigm in computer vision. However, a comprehensive analysis of their evolutionary trajectory and the profound "spillover" of their core "local-global synergy" philosophy is lacking. This paper provides a systematic review of this evolution, charting its development through four key stages: (1) early modular splicing and replacement, (2) native synergistic design for unified architectures, (3) ideological fusion influencing pure CNN and Transformer paradigms, and (4) the current rise of "Poly-Hybrids" integrating emerging operators like State-Space Models (SSMs). We analyze the critical challenges confronting these advanced models, including prohibitive resource barriers, interpretability black boxes, and the fine-grained alignment gap in vision-language tasks. We conclude that the field is at an inflection point, where the pursuit of "stronger" models must yield to the necessity of "more reliable" ones. Future progress will hinge not just on performance, but on achieving breakthroughs in sustainability, trustworthiness, and alignment, positioning these architectures as the perceptual bedrock for general world models.
Kun Liu (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: