Accurate prediction of urban land use changes at fine spatial scales is essential for developing healthy and sustainable cities, yet traditional simulation models struggle to capture local dynamics due to limited availability of fine-grained data and insufficient complexity in modeling urban systems. To address these limitations, we propose a novel approach that leverages advances in pre-trained vision-language foundation models combined with spatial dynamic modeling to forecast detailed urban land use patterns. Specifically, we collected a spatially dense collection of street view images (SVIs) throughout Shenzhen, China, and applied UrbanCLIP, a specialized vision-language prompting framework, to perform zero-shot inference of urban land use directly from images without labeled datasets and model retraining. The resulting fine-grained classifications delineate eight distinct urban land use types, producing a detailed urban functional map. These high-resolution patterns were then integrated into a spatial dynamic model enhanced by polynomial regression to simulate urban evolution toward 2035. This approach effectively captures neighborhood influences, socioeconomic drivers, and urban planning policies. Our simulation provides actionable insights for sustainable development in Shenzhen by identifying areas for balanced growth, targeted infrastructure investments, and ecological preservation. Compared to conventional methods, our methodology significantly improves predictive accuracy and spatial granularity. By incorporating foundation models, our approach addresses traditional data constraints, offering scalable and robust tools for informed urban governance and decision-making. • Proposed a VLM-enhanced framework to predict fine-grained urban land use changes. • Achieved zero-shot land use inference based on street view images. • Produced high-resolution simulations of Shenzhen's urban dynamics toward 2035.
Cai et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: