The conversion from synthetic aperture radar (SAR) imagery to optical imagery provides practical solutions for interpreting satellite SAR data and compensating for absent optical images. To adapt to varied training dataset sizes, this study proposes to use pre-trained large-scale generative models for the SAR-to-optical translation issue. The backbone network employs a parameter-frozen latent space diffusion model, augmented by ControlNet as a conditional branch to handle SAR image inputs. To bridge the domain gap between satellite imagery and pre-training dataset, the backbone network is fine-tuned via scale and shift feature adjustments. To guide structural enhancement, a SAR feature extraction module is integrated as a branch in the decoder of a pre-trained variational autoencoder. A training strategy is proposed for fine-tuning the backbone network and decoder branch separately. Evaluated against four algorithms, the method demonstrates superiority in outperforming smaller models even with limited training. Generalization experiments and ablation studies elucidate the characteristics of pre-trained large-scale models.
Wei et al. (Mon,) studied this question.