This study conducts a comprehensive performance comparison of three prominent deep learning architectures—UNet, Conditional Generative Adversarial Network (CGAN), and Swin-Transformer—for the task of single-image shadow removal, with additional theoretical consideration given to Denoising Diffusion Probabilistic Models (DDPM). Evaluated on the ISTD benchmark dataset using quantitative metrics (PSNR, SSIM, RMSE, MAE) and qualitative visual assessment, the results establish a clear performance hierarchy. The Swin-Transformer model consistently achieves superior results, excelling in detail preservation, artifact reduction, and maintaining global illumination consistency, attributed to its hierarchical structure and shifted-window self-attention mechanism. The CGAN model demonstrates enhanced perceptual realism through adversarial training, while the UNet provides a computationally efficient baseline. The findings offer practical guidance for model selection based on specific application requirements and highlight the impact of architectural design. This analysis concludes by suggesting future research pathways, including the exploration of hybrid models and the empirical application of diffusion models for high-fidelity image restoration tasks.
Shangan Zhou (Thu,) studied this question.