Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Fine-grained urban land use simulation : integrating spatial dynamic modeling with a pre-trained vision-language model

Cai, Zipan ; Karvonen, Andrew LU ; Cong, Cong and Huang, Weiming (2026) In Computers, Environment and Urban Systems 126.
Abstract
Accurate prediction of urban land use changes at fine spatial scales is essential for developing healthy and sustainable cities, yet traditional simulation models struggle to capture local dynamics due to limited availability of fine-grained data and insufficient complexity in modeling urban systems. To address these limitations, we propose a novel approach that leverages advances in pre-trained vision-language foundation models combined with spatial dynamic modeling to forecast detailed urban land use patterns. Specifically, we collected a spatially dense collection of street view images (SVIs) throughout Shenzhen, China, and applied UrbanCLIP, a specialized vision-language prompting framework, to perform zero-shot inference of urban land... (More)
Accurate prediction of urban land use changes at fine spatial scales is essential for developing healthy and sustainable cities, yet traditional simulation models struggle to capture local dynamics due to limited availability of fine-grained data and insufficient complexity in modeling urban systems. To address these limitations, we propose a novel approach that leverages advances in pre-trained vision-language foundation models combined with spatial dynamic modeling to forecast detailed urban land use patterns. Specifically, we collected a spatially dense collection of street view images (SVIs) throughout Shenzhen, China, and applied UrbanCLIP, a specialized vision-language prompting framework, to perform zero-shot inference of urban land use directly from images without labeled datasets and model retraining. The resulting fine-grained classifications delineate eight distinct urban land use types, producing a detailed urban functional map. These high-resolution patterns were then integrated into a spatial dynamic model enhanced by polynomial regression to simulate urban evolution toward 2035. This approach effectively captures neighborhood influences, socioeconomic drivers, and urban planning policies. Our simulation provides actionable insights for sustainable development in Shenzhen by identifying areas for balanced growth, targeted infrastructure investments, and ecological preservation. Compared to conventional methods, our methodology significantly improves predictive accuracy and spatial granularity. By incorporating foundation models, our approach addresses traditional data constraints, offering scalable and robust tools for informed urban governance and decision-making. (Less)
Please use this url to cite or link to this publication:
author
; ; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Land use change, Vision-language models, Foundation models, Spatial dynamic modeling, Street view images
in
Computers, Environment and Urban Systems
volume
126
article number
102416
pages
16 pages
publisher
Elsevier
ISSN
0198-9715
DOI
10.1016/j.compenvurbsys.2026.102416
project
Urban Arena
language
English
LU publication?
yes
id
0009a7c5-3dc1-462d-b55c-321d424a0466
date added to LUP
2026-02-26 18:24:52
date last changed
2026-03-24 09:51:04
@article{0009a7c5-3dc1-462d-b55c-321d424a0466,
  abstract     = {{Accurate prediction of urban land use changes at fine spatial scales is essential for developing healthy and sustainable cities, yet traditional simulation models struggle to capture local dynamics due to limited availability of fine-grained data and insufficient complexity in modeling urban systems. To address these limitations, we propose a novel approach that leverages advances in pre-trained vision-language foundation models combined with spatial dynamic modeling to forecast detailed urban land use patterns. Specifically, we collected a spatially dense collection of street view images (SVIs) throughout Shenzhen, China, and applied UrbanCLIP, a specialized vision-language prompting framework, to perform zero-shot inference of urban land use directly from images without labeled datasets and model retraining. The resulting fine-grained classifications delineate eight distinct urban land use types, producing a detailed urban functional map. These high-resolution patterns were then integrated into a spatial dynamic model enhanced by polynomial regression to simulate urban evolution toward 2035. This approach effectively captures neighborhood influences, socioeconomic drivers, and urban planning policies. Our simulation provides actionable insights for sustainable development in Shenzhen by identifying areas for balanced growth, targeted infrastructure investments, and ecological preservation. Compared to conventional methods, our methodology significantly improves predictive accuracy and spatial granularity. By incorporating foundation models, our approach addresses traditional data constraints, offering scalable and robust tools for informed urban governance and decision-making.}},
  author       = {{Cai, Zipan and Karvonen, Andrew and Cong, Cong and Huang, Weiming}},
  issn         = {{0198-9715}},
  keywords     = {{Land use change; Vision-language models; Foundation models; Spatial dynamic modeling; Street view images}},
  language     = {{eng}},
  month        = {{02}},
  publisher    = {{Elsevier}},
  series       = {{Computers, Environment and Urban Systems}},
  title        = {{Fine-grained urban land use simulation : integrating spatial dynamic modeling with a pre-trained vision-language model}},
  url          = {{http://dx.doi.org/10.1016/j.compenvurbsys.2026.102416}},
  doi          = {{10.1016/j.compenvurbsys.2026.102416}},
  volume       = {{126}},
  year         = {{2026}},
}