Medical image segmentation is crucial for disease diagnosis and monitoring, but existing methods face challenges in capturing both local and global features efficiently. Convolutional Neural Network (CNN)-based approaches such as UNet, excel at modeling local features but struggle with capturing long-range features. Transformer-based methods, such as Swin-UNet, can model global context but lack the spatial inductive bias needed for local feature extraction. Hybrid methods such as TransUNet and CS-UNet, which combine CNNs and Transformers, have shown promise but often come with increased model complexity and computational cost, limiting their practical applicability. To address these limitations, we propose a neural network GC-UNet a lightweight and efficient segmentation network that leverages the Global Context Vision Transformer (GC-ViT) in its encoder and decoder. GC-UNet combines global context self-attention with local self-attention to model both long and short-range spatial dependencies effectively. For further enhancement, we also introduce two variations of GC-UNet: (1) Hi-GC-UNet, which adds depthwise convolution to improve local feature extraction, and (2) ECA-GC-UNet, which replaces the Squeeze-and-Excitation (SE) block with Efficient Channel Attention (ECA) block to reduce model complexity in the encoders and decoders. The proposed methods and its variants are evaluated on multiple medical image datasets, including the Synapse multi-organ abdominal CT dataset, the ACDC cardiac MRI dataset, and several Polyp segmentation datasets. In terms of Dice Similarity Coefficient (DSC) and Hausdorff Distance (HD) metrics, GC-UNet outperforms CNN-based and Transformer-based approaches, with notable gains in the segmentation of complex anatomical structures. Hi-GC-UNet performs better than GC-UNet for ACDC dataset with slightly larger model size. ECA-GC-UNet performs better than GC-UNet for most datasets with slightly smaller model size. Furthermore, pre-training GC-UNet on the MedNet dataset, which contains over 200,000 medical images, yields better performance than pre-training on natural images (ImageNet). The proposed GC-UNet and its variants offer a practical and efficient solution for medical image segmentation, making them suitable for real-world clinical applications.
Alrfou et al. (Fri,) studied this question.