This paper reviews the evolution of deep learning-based image recognition models, focusing on key milestones in convolutional neural networks (CNNs), vision transformers (ViTs), and auxiliary techniques like data augmentation, regularization, transfer learning, and model compression. It traces the development of CNN architectures, including LeNet-5, and discusses their contributions to improving accuracy and addressing challenges like vanishing gradients. The paper also explores the transformative impact of transformers in image recognition, discussing models like ViT, Swin Transformer, and DeiT. It also analyzes critical supporting techniques to enhance model performance and practicality. The paper provides insights into the current state of image recognition, identifies open challenges, and outlines future research directions, aiming to inspire further innovation in theoretical foundations and real-world applications of deep learning for visual recognition.
Jing Chen (Tue,) studied this question.