Accurate segmentation of colorectal polyps in colonoscopy images is crucial for early prevention and computer-aided diagnosis of colorectal cancer, yet large variations in polyp appearance, low polyp-mucosa contrast, and device-related imaging discrepancies still hinder robust performance, especially for small and flat lesions and cross-dataset generalization. To address these challenges, we propose a Dual-Encoder Global–Local Joint Feature Aggregation Network (DEGF-Net) that enhances feature fusion and improves generalization. DEGF-Net adopts a dual-encoder architecture that separately models long-range global context and fine-grained local textures. A Global Joint Feature Fusion Module (GFFM) employs global attention to align and aggregate high-level features from both branches into a unified representation, while an Upper-Lower Level Feature Fusion Module (UL-FM) performs residual multi-scale cross-layer fusion in the decoder to narrow the semantic gap between high-level semantics and low-level details and refine polyp boundaries. In addition, a multi-output hybrid loss is applied to the final and intermediate predictions to leverage deep supervision, accelerate convergence, and improve robustness. Experiments on two benchmark colonoscopy datasets, Kvasir-SEG and CVC-ClinicDB, show that under a unified setting, DEGF-Net achieves mean Dice scores of 0.933 and 0.958, respectively, surpassing recent CNN-based, Transformer-based, and hybrid architectures and exhibiting strong cross-dataset generalization. These results indicate that DEGF-Net can substantially improve automatic polyp segmentation and provide a promising technical basis for computer-aided colorectal cancer screening. • A novel CNN-Transformer dual-encoder framework is proposed for colorectal polyp segmentation. • A global joint feature fusion module explicitly aligns high-level CNN and Transformer semantics. • A residual cross-scale fusion strategy bridges the semantic gap between global context and fine details. • The proposed method achieves Dice scores of 0.933 and 0.958 on Kvasir-SEG and CVC-ClinicDB. • Strong cross-dataset and cross-domain generalization is demonstrated on retinal and cell datasets.
Yu et al. (Mon,) studied this question.