Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Estimating dissolved organic carbon in Lake Mälaren and Lake Erken, Sweden: A comparative study of spatial transferability and model generalisability using XGBoost and Sentinel-2 data

Albus, Christina Elena LU (2025) In Student thesis series INES NGEM01 20251
Dept of Physical Geography and Ecosystem Science
Abstract
Accurate monitoring of dissolved organic carbon (DOC) in lakes is essential for understanding aquatic carbon dynamics, assessing lake ecosystem health, and supporting water resource management. While remote sensing offers scalable alternatives to in-situ monitoring, machine learning models often struggle to generalise across systems. This thesis investigates the potential of the eXtreme Gradient Boosting (XGBoost) machine learning algorithm for DOC concentrations in two lakes in central eastern Sweden: Lake Mälaren and Erken. The XGBoost model integrates Sentinel-2 surface reflectance imagery, environmental variables from ERA5-Land reanalysis data, geographic coordinates and in-situ measurements. A locally trained XGBoost model for Lake... (More)
Accurate monitoring of dissolved organic carbon (DOC) in lakes is essential for understanding aquatic carbon dynamics, assessing lake ecosystem health, and supporting water resource management. While remote sensing offers scalable alternatives to in-situ monitoring, machine learning models often struggle to generalise across systems. This thesis investigates the potential of the eXtreme Gradient Boosting (XGBoost) machine learning algorithm for DOC concentrations in two lakes in central eastern Sweden: Lake Mälaren and Erken. The XGBoost model integrates Sentinel-2 surface reflectance imagery, environmental variables from ERA5-Land reanalysis data, geographic coordinates and in-situ measurements. A locally trained XGBoost model for Lake Mälaren yielded accurate DOC predictions, characterised by low error metrics and consistent performance, suggesting high model reliability under varied feature combinations. The SHapley Additive exPlanations (SHAP) analysis identified latitude and catchment runoff as the primary predictive variables, whereas spectral reflectance features contributed the least. To evaluate spatial transferability and model generalisation, two approaches were applied: (1) the Cross-Lake Generalisation Model, which was trained on Lake Mälaren and independently tested on Lake Erken, and (2) the Lake-to-Lake Transferred Model, which used Lake Erken data for training and testing but utilised the hyperparameter configuration derived from the Lake Mälaren. While the Cross-Lake Generalisation Model showed underfitting and poor generalisation, the Lake to-Lake approach offered more stable, though still limited, predictive accuracy. These results highlight that while model architecture can transfer, successful application across lakes depends on local data structure, input alignment, and ecological context. SHAP proved essential in interpreting model logic and assessing generalisation.
Overall, the findings demonstrate that while XGBoost offers strong predictive performance in well-characterised lakes, its broader applicability remains constrained by differences in ecological context, data distribution, and variable relevance. The SHAP analysis offered valuable insight into system-specific predictor relevance and shifts in feature importance, enhancing interpretability. The results underscore the need for expanded predictor testing and the development of more transferable modelling frameworks to support scalable DOC monitoring across diverse freshwater systems. (Less)
Popular Abstract
This thesis investigates how satellite data and machine learning can be used to estimate the concentration of dissolved organic carbon (DOC) in lakes, a key water quality indicator that affects both the environment and drinking water treatment. The goal was to see if a model trained on one lake could be used to predict DOC in another.
Traditionally, DOC is measured through manual water sampling, which is time-consuming and difficult to scale across many lakes. In this study, satellite imagery from Sentinel-2 and environmental data from ERA5-Land were combined to train a machine learning model, XGBoost, for estimating DOC levels in two Swedish lakes: Lake Mälaren and Lake Erken. The model performed well when applied to the lake... (More)
This thesis investigates how satellite data and machine learning can be used to estimate the concentration of dissolved organic carbon (DOC) in lakes, a key water quality indicator that affects both the environment and drinking water treatment. The goal was to see if a model trained on one lake could be used to predict DOC in another.
Traditionally, DOC is measured through manual water sampling, which is time-consuming and difficult to scale across many lakes. In this study, satellite imagery from Sentinel-2 and environmental data from ERA5-Land were combined to train a machine learning model, XGBoost, for estimating DOC levels in two Swedish lakes: Lake Mälaren and Lake Erken. The model performed well when applied to the lake individually, with location and runoff emerging as the most important factors for prediction.
The thesis further explored how transferable the model was between different lake systems. The model originally trained on Lake Mälaren was then tested on Lake Erken, however this approach performed poorly. In a second approach, the model was retrained using Lake Erken’s data while keeping the same internal settings from the Lake Mälaren model. This improved the results slightly but still fell short compared to a fully Erken-specific model. Still, the same input variables proved important in both cases, suggesting they play a consistent role even when accuracy varies between transferred models.
The findings show that the XGBoost model for DOC estimation works well when trained and applied within the same lake system but struggles to maintain accuracy when transferred to another. These limitations underscore the need for adaptable, context-sensitive modelling frameworks and highlight the ecological and data-related variability among different input systems. (Less)
Please use this url to cite or link to this publication:
author
Albus, Christina Elena LU
supervisor
organization
alternative title
Estimating dissolved organic carbon in Swedish lakes: How well do machine learning models generalise across lakes?
course
NGEM01 20251
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Physical Geography and Ecosystem analysis, Dissolved Organic Carbon (DOC), Inland Waters, Machine Learning (ML), Remote Sensing (RS), Sentinel-2, SHAP, Water quality, XGBoost
publication/series
Student thesis series INES
report number
732
language
English
id
9204968
date added to LUP
2025-06-24 14:11:53
date last changed
2025-06-24 14:11:53
@misc{9204968,
  abstract     = {{Accurate monitoring of dissolved organic carbon (DOC) in lakes is essential for understanding aquatic carbon dynamics, assessing lake ecosystem health, and supporting water resource management. While remote sensing offers scalable alternatives to in-situ monitoring, machine learning models often struggle to generalise across systems. This thesis investigates the potential of the eXtreme Gradient Boosting (XGBoost) machine learning algorithm for DOC concentrations in two lakes in central eastern Sweden: Lake Mälaren and Erken. The XGBoost model integrates Sentinel-2 surface reflectance imagery, environmental variables from ERA5-Land reanalysis data, geographic coordinates and in-situ measurements. A locally trained XGBoost model for Lake Mälaren yielded accurate DOC predictions, characterised by low error metrics and consistent performance, suggesting high model reliability under varied feature combinations. The SHapley Additive exPlanations (SHAP) analysis identified latitude and catchment runoff as the primary predictive variables, whereas spectral reflectance features contributed the least. To evaluate spatial transferability and model generalisation, two approaches were applied: (1) the Cross-Lake Generalisation Model, which was trained on Lake Mälaren and independently tested on Lake Erken, and (2) the Lake-to-Lake Transferred Model, which used Lake Erken data for training and testing but utilised the hyperparameter configuration derived from the Lake Mälaren. While the Cross-Lake Generalisation Model showed underfitting and poor generalisation, the Lake to-Lake approach offered more stable, though still limited, predictive accuracy. These results highlight that while model architecture can transfer, successful application across lakes depends on local data structure, input alignment, and ecological context. SHAP proved essential in interpreting model logic and assessing generalisation.
Overall, the findings demonstrate that while XGBoost offers strong predictive performance in well-characterised lakes, its broader applicability remains constrained by differences in ecological context, data distribution, and variable relevance. The SHAP analysis offered valuable insight into system-specific predictor relevance and shifts in feature importance, enhancing interpretability. The results underscore the need for expanded predictor testing and the development of more transferable modelling frameworks to support scalable DOC monitoring across diverse freshwater systems.}},
  author       = {{Albus, Christina Elena}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Student thesis series INES}},
  title        = {{Estimating dissolved organic carbon in Lake Mälaren and Lake Erken, Sweden: A comparative study of spatial transferability and model generalisability using XGBoost and Sentinel-2 data}},
  year         = {{2025}},
}