• FocusRvNN enables AI-driven generation of large-scale 3D urban digital twins for sustainable planning, addressing data scarcity and computational costs while improving fidelity and diversity in complex urban scenes. • Attention-based subgraph clustering boosts substructure extraction accuracy by 16.1% in unbounded urban settings, dynamically identifying high-frequency patterns and suppressing noise for reliable semantic anchors. • GCN conditional embeddings enhance spatial relation precision by 13.3% for controllable geospatial modeling, enabling diversified generation and style transfer through graph-based representations. • Multi-level splicing optimization ensures physical consistency, reduces semantic inconsistency by 15%, and raises scene completeness by 15% via coarse-to-fine hierarchical assembly. • UE5 integration achieves faster real-time VR visualization for immersive health assessments, supporting infinite detail geometry and dynamic global illumination in large-scale interactions. Generating large-scale 3D urban street scenes from structured data is a key challenge in geospatial computing and urban simulation. Existing generative approaches often struggle to reuse semantically coherent local structures, to encode relational constraints beyond isolated objects, and to organize unbounded outdoor layouts in a controllable and physically consistent manner, which limits their applicability to complex road networks and heterogeneous roadside infrastructure. To address these challenges at the algorithmic level, this paper proposes FocusRvNN, a focus-driven recursive variational neural network framework based on variational autoencoders (VAEs) and scene graphs. The framework introduces a focus-driven attention mechanism to identify and cluster high-frequency geospatial substructures as reusable semantic building blocks, employs graph convolutional network (GCN) embeddings to encode inter-substructure spatial relations as conditional generation cues, and adopts a coarse-to-fine hierarchical assembly strategy to progressively compose large-scale layouts while enforcing physical and semantic consistency. The proposed framework is evaluated on the CarlaSC dataset, where it achieves an mIoU of 82.89% for layout consistency and generates a complete urban street scene in approximately 15 seconds under the tested hardware configuration. The generation pipeline is further integrated with Unreal Engine 5 to support interactive visualization and inspection, demonstrating its applicability primarily in simulated environments (evaluated on the synthetic CarlaSC dataset with supplementary semantic validation on the real-world Cityscapes dataset) to simulation-oriented workflows for urban planning studies and virtual environment design.
Fang et al. (Sun,) studied this question.