Fine-grained urban land use simulation : integrating spatial dynamic modeling with a pre-trained vision-language model
(2026) In Computers, Environment and Urban Systems 126.- Abstract
- Accurate prediction of urban land use changes at fine spatial scales is essential for developing healthy and sustainable cities, yet traditional simulation models struggle to capture local dynamics due to limited availability of fine-grained data and insufficient complexity in modeling urban systems. To address these limitations, we propose a novel approach that leverages advances in pre-trained vision-language foundation models combined with spatial dynamic modeling to forecast detailed urban land use patterns. Specifically, we collected a spatially dense collection of street view images (SVIs) throughout Shenzhen, China, and applied UrbanCLIP, a specialized vision-language prompting framework, to perform zero-shot inference of urban land... (More)
- Accurate prediction of urban land use changes at fine spatial scales is essential for developing healthy and sustainable cities, yet traditional simulation models struggle to capture local dynamics due to limited availability of fine-grained data and insufficient complexity in modeling urban systems. To address these limitations, we propose a novel approach that leverages advances in pre-trained vision-language foundation models combined with spatial dynamic modeling to forecast detailed urban land use patterns. Specifically, we collected a spatially dense collection of street view images (SVIs) throughout Shenzhen, China, and applied UrbanCLIP, a specialized vision-language prompting framework, to perform zero-shot inference of urban land use directly from images without labeled datasets and model retraining. The resulting fine-grained classifications delineate eight distinct urban land use types, producing a detailed urban functional map. These high-resolution patterns were then integrated into a spatial dynamic model enhanced by polynomial regression to simulate urban evolution toward 2035. This approach effectively captures neighborhood influences, socioeconomic drivers, and urban planning policies. Our simulation provides actionable insights for sustainable development in Shenzhen by identifying areas for balanced growth, targeted infrastructure investments, and ecological preservation. Compared to conventional methods, our methodology significantly improves predictive accuracy and spatial granularity. By incorporating foundation models, our approach addresses traditional data constraints, offering scalable and robust tools for informed urban governance and decision-making. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/0009a7c5-3dc1-462d-b55c-321d424a0466
- author
- Cai, Zipan ; Karvonen, Andrew LU ; Cong, Cong and Huang, Weiming
- organization
- publishing date
- 2026-02-26
- type
- Contribution to journal
- publication status
- published
- subject
- keywords
- Land use change, Vision-language models, Foundation models, Spatial dynamic modeling, Street view images
- in
- Computers, Environment and Urban Systems
- volume
- 126
- article number
- 102416
- pages
- 16 pages
- publisher
- Elsevier
- ISSN
- 0198-9715
- DOI
- 10.1016/j.compenvurbsys.2026.102416
- project
- Urban Arena
- language
- English
- LU publication?
- yes
- id
- 0009a7c5-3dc1-462d-b55c-321d424a0466
- date added to LUP
- 2026-02-26 18:24:52
- date last changed
- 2026-03-24 09:51:04
@article{0009a7c5-3dc1-462d-b55c-321d424a0466,
abstract = {{Accurate prediction of urban land use changes at fine spatial scales is essential for developing healthy and sustainable cities, yet traditional simulation models struggle to capture local dynamics due to limited availability of fine-grained data and insufficient complexity in modeling urban systems. To address these limitations, we propose a novel approach that leverages advances in pre-trained vision-language foundation models combined with spatial dynamic modeling to forecast detailed urban land use patterns. Specifically, we collected a spatially dense collection of street view images (SVIs) throughout Shenzhen, China, and applied UrbanCLIP, a specialized vision-language prompting framework, to perform zero-shot inference of urban land use directly from images without labeled datasets and model retraining. The resulting fine-grained classifications delineate eight distinct urban land use types, producing a detailed urban functional map. These high-resolution patterns were then integrated into a spatial dynamic model enhanced by polynomial regression to simulate urban evolution toward 2035. This approach effectively captures neighborhood influences, socioeconomic drivers, and urban planning policies. Our simulation provides actionable insights for sustainable development in Shenzhen by identifying areas for balanced growth, targeted infrastructure investments, and ecological preservation. Compared to conventional methods, our methodology significantly improves predictive accuracy and spatial granularity. By incorporating foundation models, our approach addresses traditional data constraints, offering scalable and robust tools for informed urban governance and decision-making.}},
author = {{Cai, Zipan and Karvonen, Andrew and Cong, Cong and Huang, Weiming}},
issn = {{0198-9715}},
keywords = {{Land use change; Vision-language models; Foundation models; Spatial dynamic modeling; Street view images}},
language = {{eng}},
month = {{02}},
publisher = {{Elsevier}},
series = {{Computers, Environment and Urban Systems}},
title = {{Fine-grained urban land use simulation : integrating spatial dynamic modeling with a pre-trained vision-language model}},
url = {{http://dx.doi.org/10.1016/j.compenvurbsys.2026.102416}},
doi = {{10.1016/j.compenvurbsys.2026.102416}},
volume = {{126}},
year = {{2026}},
}