What question did this study set out to answer?

The research aims to improve the segmentation of carious lesions in dental images by combining U-Net and vision transformer models.

April 12, 2026Open Access

ViTUNet: Vision Transformer U-Net Hybrid Model for Carious Lesions Segmentation on Bitewing Dental Images

Key Points

The research aims to improve the segmentation of carious lesions in dental images by combining U-Net and vision transformer models.
Proposed a hybrid model named ViTUNet combining U-Net and vision transformers.
Pre-processed bitewing dental radiographs using augmentation and morphological operations.
Evaluated the model on an augmented dataset of 3000 image-mask pairs, with 2400 for training and 600 for testing.
Achieved 98.45% validation accuracy on segmentation tasks.
Obtained a 97.88% validation Dice coefficient indicating high overlap with ground truth.
Achieved 95.87% validation intersection over union (IoU) metric scores, surpassing traditional models.

Abstract

Meticulous segmentation of medical images requires obtaining both local and global spatial detailed information. The conventional U-Net model excels at local spatial feature extraction through residual convolutional blocks but struggles to capture global features. To resolve this issue, we propose the vision transformer U-NeT (ViTUNet) model framework, which combines the self-attention mechanism of the vision transformer (ViT) to capture global information while maintaining the extraction of local features via U-NeT. This proposed architecture introduces vision transformers to the existing residual convolution blocks in the U-Net encoder path, thereby capturing both local and global features. The decoder path then rebuilds this information into high-quality segmentation maps with accurately highlighted boundaries/edges. This model is utilized to segment carious lesions in bitewing dental radiographs. These images are pre-processed using augmentation, morphological operations, and segmentation to identify the boundaries/edges of the regions of interest (caries/cavity). The proposed method is evaluated on an augmented dataset containing 3000 image–watershed mask pairs. It was trained on 2400 training images and tested on 600 testing images. The experimental results exemplified significant improvements in segmentation performance, achieving 98.45% validation accuracy, 97.88% validation Dice coefficient, and 95.87% validation intersection over union (IoU) metric scores. These results are superior compared to other conventional and state-of-the-art U-NeT models, thus highlighting the impact of transformer-based hybrid architectures in improving medical image segmentation tasks.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Majanga et al. (Thu,) studied this question.

synapsesocial.com/papers/69db37ca4fe01fead37c5cdd https://doi.org/https://doi.org/10.3390/app16083693

Bookmark

View Full Paper