Discovering urban functional zones (UFZs) is critical for understanding city spatial structures and supporting effective urban planning. Existing approaches to UFZ discovery typically rely on one of three costly strategies: (1) training large vision models directly on satellite imagery, which demands substantial computational resources; (2) leveraging crowdsourced data such as Points of Interest (POIs) from platforms like OpenStreetMap, which may be incomplete, inconsistent, or unavailable in many regions; or (3) collecting custom labeled data, which requires significant time, expense, and expert effort. Recently, large multi-modal models (LMMs) emerged as a promising alternative, offering strong capabilities in interpreting visual content without requiring extensive data labeling. However, their performance remains limited when applied to the UFZ discovery task, often struggling to capture the complex spatial and functional details and interactions of urban regions. To address this challenge, we propose a new approach that enhances LMMs’ reasoning capability to recognize urban functional zones by keeping LMM encoders frozen while training only lightweight graph-based models, eliminating the need for LMM fine-tuning or additional pre-training. Specifically, our approach first partitions the target area into small regions by road network, where an LMM is used for each region to generate visual and textual embeddings independently using its image and text encoders. Then, two graphs are constructed in which nodes represent regions with features defined by their respective embeddings, and edges encode their spatial adjacency. Message-passing on the two graphs hence captures spatial correlation between the visual and textual modalities. After that, graph clustering will suggest the prototypes representing nearby zones with similar urban functions, where contrastive learning is further leveraged to encourage cross-modal consistency. Evaluation on four city districts, namely Philadelphia (PA, USA), Pudong (Shanghai, China), San Francisco (CA, USA), and Seattle (WA, USA), substantiates the effectiveness of our proposal and that its performance is on par with supervised competitors.
Lian et al. (Wed,) studied this question.