Evaluating large language model adaptation strategies for geospatial code generation
(2025) In Student thesis series INES NGEM01 20251Dept of Physical Geography and Ecosystem Science
- Abstract
- Recent advances in Large Language Models (LLMs) offer a promising alternative by enabling code generation from natural language. However, despite progress, LLMs still struggle with spatial reasoning, structural fidelity, and robustness across diverse GIS datasets. This thesis systematically compares three LLM adaptation strategies, i.e., prompt engineering, retrieval-augmented generation (RAG), and QLoRA fine-tuning, for their effectiveness in geospatial code generation. A custom multi-agent evaluation framework is developed, testing six configurations on real-world datasets and validated QGIS scripts. The evaluation combines semantic metrics (CodeBERTScore, embedding similarity) with structural measures (CodeBLEU), and introduces... (More)
- Recent advances in Large Language Models (LLMs) offer a promising alternative by enabling code generation from natural language. However, despite progress, LLMs still struggle with spatial reasoning, structural fidelity, and robustness across diverse GIS datasets. This thesis systematically compares three LLM adaptation strategies, i.e., prompt engineering, retrieval-augmented generation (RAG), and QLoRA fine-tuning, for their effectiveness in geospatial code generation. A custom multi-agent evaluation framework is developed, testing six configurations on real-world datasets and validated QGIS scripts. The evaluation combines semantic metrics (CodeBERTScore, embedding similarity) with structural measures (CodeBLEU), and introduces AST-derived indicators of structural complexity and behavioral richness on generated code. Results show that while all strategies affect code structure, their impact on semantic fidelity is limited. Prompting and RAG enhance structural conformity by imposing external scaffolds, whereas QLoRA improves fluency and intent alignment but struggles with structural generalization. These findings highlight the need for structurally diverse supervision and varied training corpora to improve the reliability of fine-tuned LLMs in GIS programming contexts. (Less)
- Popular Abstract
- Original title: Evaluating Large Language Model Adaptation Strategies for Geospatial Code Generation
Geographical data is all around us—from satellite images and city maps to data showing flood zones, roads, or forest cover. This kind of information helps people make better decisions about solving real-world problems. But using it often requires writing complex code in special tools like QGIS, which is a big barrier for non-experts.
Recent advances in artificial intelligence have introduced large language models like ChatGPT that can write code based on plain English instructions. But here’s the problem: these models are not yet good at handling the complicated formats and logic used in geographic information systems (GIS). They often... (More) - Original title: Evaluating Large Language Model Adaptation Strategies for Geospatial Code Generation
Geographical data is all around us—from satellite images and city maps to data showing flood zones, roads, or forest cover. This kind of information helps people make better decisions about solving real-world problems. But using it often requires writing complex code in special tools like QGIS, which is a big barrier for non-experts.
Recent advances in artificial intelligence have introduced large language models like ChatGPT that can write code based on plain English instructions. But here’s the problem: these models are not yet good at handling the complicated formats and logic used in geographic information systems (GIS). They often make mistakes, misunderstand spatial relationships, or create code that doesn’t run.
This thesis explores three ways to help these models do better:
• Prompt Engineering: Giving the model clearer instructions.
• RAG (Retrieval-Augmented Generation): Letting the model “look up” useful documentation.
• Fine-tuning: Training the model with real GIS examples.
The results showed that no single method solves everything. Prompting and RAG helped the AI organize its code better. Fine-tuning made it more fluent and better aligned with the user’s intent. However, even the best models still had trouble with complex logic and unfamiliar data formats.
In short, AI can help automate geospatial tasks, but it needs guidance both in how it's asked and what it learns from. With smarter prompting, training, and tools to check their work, language models could one day make GIS more accessible to everyone. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9204646
- author
- Zhu, Kaiyuan LU
- supervisor
- organization
- course
- NGEM01 20251
- year
- 2025
- type
- H2 - Master's Degree (Two Years)
- subject
- keywords
- Geospatial Code Generation, Large Language Models, Prompt Engineering, Retrieval-Augmented Generation, QLoRA Fine-Tuning, Structure-Aware Code Evaluation.
- publication/series
- Student thesis series INES
- report number
- 716
- language
- English
- id
- 9204646
- date added to LUP
- 2025-06-24 14:11:10
- date last changed
- 2025-06-24 14:11:10
@misc{9204646, abstract = {{Recent advances in Large Language Models (LLMs) offer a promising alternative by enabling code generation from natural language. However, despite progress, LLMs still struggle with spatial reasoning, structural fidelity, and robustness across diverse GIS datasets. This thesis systematically compares three LLM adaptation strategies, i.e., prompt engineering, retrieval-augmented generation (RAG), and QLoRA fine-tuning, for their effectiveness in geospatial code generation. A custom multi-agent evaluation framework is developed, testing six configurations on real-world datasets and validated QGIS scripts. The evaluation combines semantic metrics (CodeBERTScore, embedding similarity) with structural measures (CodeBLEU), and introduces AST-derived indicators of structural complexity and behavioral richness on generated code. Results show that while all strategies affect code structure, their impact on semantic fidelity is limited. Prompting and RAG enhance structural conformity by imposing external scaffolds, whereas QLoRA improves fluency and intent alignment but struggles with structural generalization. These findings highlight the need for structurally diverse supervision and varied training corpora to improve the reliability of fine-tuned LLMs in GIS programming contexts.}}, author = {{Zhu, Kaiyuan}}, language = {{eng}}, note = {{Student Paper}}, series = {{Student thesis series INES}}, title = {{Evaluating large language model adaptation strategies for geospatial code generation}}, year = {{2025}}, }