What question did this study set out to answer?

The central aim is to improve the discovery of urban functional zones by utilizing large multi-modal models without extensive data labeling.

May 16, 2026

Discovering Urban Functional Zones using Training-Free Large Multi-Modal Models

Key Points

The central aim is to improve the discovery of urban functional zones by utilizing large multi-modal models without extensive data labeling.
Keeps large multi-modal model encoders frozen while training lightweight graph-based models.
Partitions the area into regions using road networks and generates embeddings with LMMs.
Constructs graphs to encode spatial adjacency and captures correlations through message-passing.
Graph clustering suggested prototypes for nearby zones based on urban functions.
Evaluation across four cities showed performance comparable to supervised models.
Demonstrated effective recognition of urban functional zones with less computational burden.

Abstract

Discovering urban functional zones (UFZs) is critical for understanding city spatial structures and supporting effective urban planning. Existing approaches to UFZ discovery typically rely on one of three costly strategies: (1) training large vision models directly on satellite imagery, which demands substantial computational resources; (2) leveraging crowdsourced data such as Points of Interest (POIs) from platforms like OpenStreetMap, which may be incomplete, inconsistent, or unavailable in many regions; or (3) collecting custom labeled data, which requires significant time, expense, and expert effort. Recently, large multi-modal models (LMMs) emerged as a promising alternative, offering strong capabilities in interpreting visual content without requiring extensive data labeling. However, their performance remains limited when applied to the UFZ discovery task, often struggling to capture the complex spatial and functional details and interactions of urban regions. To address this challenge, we propose a new approach that enhances LMMs’ reasoning capability to recognize urban functional zones by keeping LMM encoders frozen while training only lightweight graph-based models, eliminating the need for LMM fine-tuning or additional pre-training. Specifically, our approach first partitions the target area into small regions by road network, where an LMM is used for each region to generate visual and textual embeddings independently using its image and text encoders. Then, two graphs are constructed in which nodes represent regions with features defined by their respective embeddings, and edges encode their spatial adjacency. Message-passing on the two graphs hence captures spatial correlation between the visual and textual modalities. After that, graph clustering will suggest the prototypes representing nearby zones with similar urban functions, where contrastive learning is further leveraged to encourage cross-modal consistency. Evaluation on four city districts, namely Philadelphia (PA, USA), Pudong (Shanghai, China), San Francisco (CA, USA), and Seattle (WA, USA), substantiates the effectiveness of our proposal and that its performance is on par with supervised competitors.

Bookmark

Discovering Urban Functional Zones using Training-Free Large Multi-Modal Models

Key Points

Abstract

Cite This Study