ABSTRACT Medical image segmentation is pivotal in clinical diagnosis and treatment planning. However, conventional CNN‐based methods often struggle with capturing global context and handling noise, especially in complex or ambiguous anatomical regions. To address these limitations, we propose a hybrid framework that synergistically combines Transformer and diffusion models, capitalizing on their strengths in long‐range dependency modeling and denoising. In this work, we introduce TransDiff‐HiSeg, a novel Transformer‐guided Diffusion segmentation framework that integrates a conditioned diffusion model, binarized cross transformer, and adaptive feature fusion blocks. The framework comprises a parallel encoder built with convolution and transformer blocks for robust feature extraction and noise suppression, and a decoder of stacked convolutional blocks to reconstruct high‐resolution segmentation. Our model emphasizes sustainable healthcare by achieving improved segmentation accuracy with reduced computational overhead, making it suitable for long‐term clinical integration. Extensive experiments on multi‐organ and brain tumor segmentation tasks demonstrate that TransDiff‐HiSeg consistently outperforms state‐of‐the‐art methods, achieving superior Dice, Accuracy, and HD95 scores while maintaining a lightweight impact. These results validate the efficacy and sustainability of our approach in real‐world medical image segmentation scenarios.
Verma et al. (Sun,) studied this question.