Target detection, a core computer vision task, is widely applied in automatic driving, industrial quality inspection, etc. However, traditional convolutional neural networks (CNNs) are limited by local receptive field and difficulty in modeling global contextual relationships, which leads to the omission of small targets and occlusion misjudgement in complex scenes.Transformer,with global attention mechanism, can effectively capture image long-distance dependencies, which creatively improves the accuracy and efficiency of target detection. This paper comprehensively analyzes the evolution of key models like Detection Transformer (DETR), Deformable Detection Transformer (Deformable DETR), and Shifted Window Transformer (Swin Transformer), explores why these models significantly enhance average detection accuracy (AP) on the COCO dataset and investigates end-to-end detection, sparse attention mechanisms, and hierarchical design. This paper concludes that lightweighting and multimodal techniques have great potential in Transformer models, and future strategies such as dynamic sparsification and cross-modal alignment can further improve model performance. Despite Transformer's accuracy breakthroughs, challenges remain in computational efficiency and hardware dependence. Lightweight design and multimodal fusion offer new solutions to these challenges, promising to advance Transformers in real-time and multi-scenario detection. This paper provides a comprehensive view on Transformers' application in target detection and serves as a key reference for future research directions.
Zhi‐Ming Yu (Wed,) studied this question.