Convolutional Neural Networks (CNNs) and Transformers have become the two dominant architectures in the field of medical image segmentation. However, CNNs are limited in modeling long-range dependencies due to the locality of convolution operations, while Transformers may overlook fine-grained local details. To combine the advantages of both while compensating for their weaknesses, this article proposes a parallel dual-branch network named CAFormer, designed to simultaneously capture local details and global contextual information. In this architecture, the Transformer branch (BTB) is responsible for extracting global semantic features, whereas the CNN branch (BCB) incorporates a Full Dynamic Convolutional Kernel (DCK) module and a Full-Scale Channel Attention (FSC) module to enhance adaptability and representation flexibility for diverse features. Furthermore, a Prediction Head for Branch aggregation (BPH) module is introduced to effectively fuse the complementary features from both branches. Extensive experiments conducted on four public datasets—Kvasir-SEG, CVC-ClinicDB, GlaS, and ISIC 2017—demonstrate that CAFormer achieves Dice scores of 0.9394, 0.9481, 0.9381, and 0.9310, respectively. These results significantly outperform existing state-of-the-art methods, validating the superior segmentation capability of the proposed model.
Cheng et al. (Tue,) studied this question.