Meticulous segmentation of medical images requires obtaining both local and global spatial detailed information. The conventional U-Net model excels at local spatial feature extraction through residual convolutional blocks but struggles to capture global features. To resolve this issue, we propose the vision transformer U-NeT (ViTUNet) model framework, which combines the self-attention mechanism of the vision transformer (ViT) to capture global information while maintaining the extraction of local features via U-NeT. This proposed architecture introduces vision transformers to the existing residual convolution blocks in the U-Net encoder path, thereby capturing both local and global features. The decoder path then rebuilds this information into high-quality segmentation maps with accurately highlighted boundaries/edges. This model is utilized to segment carious lesions in bitewing dental radiographs. These images are pre-processed using augmentation, morphological operations, and segmentation to identify the boundaries/edges of the regions of interest (caries/cavity). The proposed method is evaluated on an augmented dataset containing 3000 image–watershed mask pairs. It was trained on 2400 training images and tested on 600 testing images. The experimental results exemplified significant improvements in segmentation performance, achieving 98.45% validation accuracy, 97.88% validation Dice coefficient, and 95.87% validation intersection over union (IoU) metric scores. These results are superior compared to other conventional and state-of-the-art U-NeT models, thus highlighting the impact of transformer-based hybrid architectures in improving medical image segmentation tasks.
Majanga et al. (Thu,) studied this question.