Evaluating large language model adaptation strategies for geospatial code generation

Zhu, Kaiyuan

Evaluating large language model adaptation strategies for geospatial code generation

Mark

Zhu, Kaiyuan ^LU (2025) In Student thesis series INES NGEM01 20251
Dept of Physical Geography and Ecosystem Science

Abstract: Recent advances in Large Language Models (LLMs) offer a promising alternative by enabling code generation from natural language. However, despite progress, LLMs still struggle with spatial reasoning, structural fidelity, and robustness across diverse GIS datasets. This thesis systematically compares three LLM adaptation strategies, i.e., prompt engineering, retrieval-augmented generation (RAG), and QLoRA fine-tuning, for their effectiveness in geospatial code generation. A custom multi-agent evaluation framework is developed, testing six configurations on real-world datasets and validated QGIS scripts. The evaluation combines semantic metrics (CodeBERTScore, embedding similarity) with structural measures (CodeBLEU), and introduces... (More); Recent advances in Large Language Models (LLMs) offer a promising alternative by enabling code generation from natural language. However, despite progress, LLMs still struggle with spatial reasoning, structural fidelity, and robustness across diverse GIS datasets. This thesis systematically compares three LLM adaptation strategies, i.e., prompt engineering, retrieval-augmented generation (RAG), and QLoRA fine-tuning, for their effectiveness in geospatial code generation. A custom multi-agent evaluation framework is developed, testing six configurations on real-world datasets and validated QGIS scripts. The evaluation combines semantic metrics (CodeBERTScore, embedding similarity) with structural measures (CodeBLEU), and introduces AST-derived indicators of structural complexity and behavioral richness on generated code. Results show that while all strategies affect code structure, their impact on semantic fidelity is limited. Prompting and RAG enhance structural conformity by imposing external scaffolds, whereas QLoRA improves fluency and intent alignment but struggles with structural generalization. These findings highlight the need for structurally diverse supervision and varied training corpora to improve the reliability of fine-tuned LLMs in GIS programming contexts. (Less)
Popular Abstract: Original title: Evaluating Large Language Model Adaptation Strategies for Geospatial Code Generation
Geographical data is all around us—from satellite images and city maps to data showing flood zones, roads, or forest cover. This kind of information helps people make better decisions about solving real-world problems. But using it often requires writing complex code in special tools like QGIS, which is a big barrier for non-experts.
Recent advances in artificial intelligence have introduced large language models like ChatGPT that can write code based on plain English instructions. But here’s the problem: these models are not yet good at handling the complicated formats and logic used in geographic information systems (GIS). They often... (More); Original title: Evaluating Large Language Model Adaptation Strategies for Geospatial Code Generation
Geographical data is all around us—from satellite images and city maps to data showing flood zones, roads, or forest cover. This kind of information helps people make better decisions about solving real-world problems. But using it often requires writing complex code in special tools like QGIS, which is a big barrier for non-experts.
Recent advances in artificial intelligence have introduced large language models like ChatGPT that can write code based on plain English instructions. But here’s the problem: these models are not yet good at handling the complicated formats and logic used in geographic information systems (GIS). They often make mistakes, misunderstand spatial relationships, or create code that doesn’t run.
This thesis explores three ways to help these models do better:
• Prompt Engineering: Giving the model clearer instructions.
• RAG (Retrieval-Augmented Generation): Letting the model “look up” useful documentation.
• Fine-tuning: Training the model with real GIS examples.
The results showed that no single method solves everything. Prompting and RAG helped the AI organize its code better. Fine-tuning made it more fluent and better aligned with the user’s intent. However, even the best models still had trouble with complex logic and unfamiliar data formats.
In short, AI can help automate geospatial tasks, but it needs guidance both in how it's asked and what it learns from. With smarter prompting, training, and tools to check their work, language models could one day make GIS more accessible to everyone. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9204646

author

Zhu, Kaiyuan ^LU

supervisor

Rachid Oucheikh ^LU
Ali Mansourian ^LU

organization

Dept of Physical Geography and Ecosystem Science

course

NGEM01 20251

year

2025

type

H2 - Master's Degree (Two Years)

subject

Earth and Environmental Sciences

keywords

Geospatial Code Generation, Large Language Models, Prompt Engineering, Retrieval-Augmented Generation, QLoRA Fine-Tuning, Structure-Aware Code Evaluation.

publication/series

Student thesis series INES

report number

716

language

English

id

9204646

date added to LUP

2025-06-24 14:11:10

date last changed

2025-06-24 14:11:10

@misc{9204646,
  abstract     = {{Recent advances in Large Language Models (LLMs) offer a promising alternative by enabling code generation from natural language. However, despite progress, LLMs still struggle with spatial reasoning, structural fidelity, and robustness across diverse GIS datasets. This thesis systematically compares three LLM adaptation strategies, i.e., prompt engineering, retrieval-augmented generation (RAG), and QLoRA fine-tuning, for their effectiveness in geospatial code generation. A custom multi-agent evaluation framework is developed, testing six configurations on real-world datasets and validated QGIS scripts. The evaluation combines semantic metrics (CodeBERTScore, embedding similarity) with structural measures (CodeBLEU), and introduces AST-derived indicators of structural complexity and behavioral richness on generated code. Results show that while all strategies affect code structure, their impact on semantic fidelity is limited. Prompting and RAG enhance structural conformity by imposing external scaffolds, whereas QLoRA improves fluency and intent alignment but struggles with structural generalization. These findings highlight the need for structurally diverse supervision and varied training corpora to improve the reliability of fine-tuned LLMs in GIS programming contexts.}},
  author       = {{Zhu, Kaiyuan}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Student thesis series INES}},
  title        = {{Evaluating large language model adaptation strategies for geospatial code generation}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Evaluating large language model adaptation strategies for geospatial code generation