Semantic segmentation of moving objects in dynamic backgrounds faces core challenges such as background interference and blurred target features. This study proposes an innovative architecture that integrates Generative Adversarial Network (GAN) with Transformers. The GAN module enhances adaptability to dynamic backgrounds through adversarial training, while the self-attention mechanism in the Transformer captures long-range semantic dependencies. A gated fusion strategy is designed to achieve dynamic balancing of multimodal features. The method employs a conditional GAN to generate dynamic background samples with variations in illumination and motion blur. A Transformer-based encoder-decoder structure is used to model global contextual relationships. A temporal attention module is introduced to incorporate motion vector fields, improving temporal consistency. Additionally, a KL-divergence (KL) constrained semantic consistency loss optimizes the plausibility of generated samples. Experiments are conducted on both a multi-dimensional simulated dataset and the real-world KITTI dataset. Results show that the proposed model achieves an average Intersection over Union (IoU) of 85.6% in standard dynamic scenes, outperforming DeepLabv3 + by 9.2% points. In low-light and high-speed motion scenarios, the robustness index reaches 92.0%, 8.5 points higher than baseline models. Ablation studies demonstrate that removing the Transformer leads to a 6.7% drop in mIoU, while excluding the feature fusion module reduces robustness by 4.0%, confirming the necessity of both components. Temporal analysis reveals that the model maintains a stable performance of 84.5–86.5% over 20-frame sequences, with fluctuation reduced by 63% compared to baseline. The adversarial training improves the model’s adaptability to lighting changes by 5.3%. The multi-head self-attention (MSA) mechanism reduces long-range misclassification by 6.7%. The gated fusion strategy lowers false positive rates in background-disturbed regions by 12.8%. This framework optimizes segmentation through a generator-segmenter feedback loop, effectively balancing dynamic background noise suppression and semantic fidelity. The contributions are threefold: (1) The first semantic segmentation framework to deeply integrate GANs and Transformers. (2) A theoretical model for dynamic feature gating and semantic consistency constraints. (3) A standardized evaluation system covering 10 dynamic background types and five illumination gradients. This study provides key technical support for real-time environmental perception in autonomous driving and intelligent surveillance, advancing both the theoretical and practical frontiers of dynamic scene understanding.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yiqiang Li
Chengdu Institute of Biology
ZhenBao Luo
Chengdu Institute of Biology
Tao Chen
Chengdu Institute of Biology
Scientific Reports
Chengdu Institute of Biology
Aviation Industry Corporation of China (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Sun,) studied this question.
synapsesocial.com/papers/69b256fe96eeacc4fcec5b3d — DOI: https://doi.org/10.1038/s41598-026-39249-1