Aiming at semantic disconnection and visual distortion between generated results and actual scenes in environmental art design, this paper proposes a consistent generation and simulation model based on multimodal transformer.Traditional methods have limitations in coordinating complex elements and ensuring spatial logic, hindering design implementation.By integrating multi-source information including text, sketches, and scene images, an end-to-end generation-simulation framework achieves consistent mapping from concept to high-fidelity visual output.Using the public dataset MIT ade20k, results show the model achieves significant improvements in visual fidelity (area under the curve 0.92, an increase of 8.2%) and user preference (normalised discounted cumulative gain @10 an increase of 15.7%), with all key indicators being statistically significant (p < 0.01).This confirms the model's effectiveness in enhancing automation and usability of environmental art design.
Li Ren (Thu,) studied this question.