Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Estimation of dissolved organic carbon from inland waters using remote sensing data and machine learning

Harkort, Lasse LU (2022) In Student thesis series INES NGEM01 20221
Dept of Physical Geography and Ecosystem Science
Abstract
This thesis presents the first attempt to estimate Dissolved Organic Carbon (DOC) in inland waters over a large-scale area using satellite data and machine learning (ML) methods. Four ML approaches, namely Random Forest Regression (RFR), Support Vector Regression (SVR), Gaussian Process Regression (GPR), and a Multilayer Backpropagation Neural Network (MBPNN) were tested to retrieve DOC using a filtered version of the recently published open source AquaSat dataset with more than 16 thousand samples across the continental US matched with satellite data from Landsat 5, 7 and 8 missions. In this work, the AquaSat dataset was extended with environmental data from the ERA5-Land product.
Including environmental data considerably improved the... (More)
This thesis presents the first attempt to estimate Dissolved Organic Carbon (DOC) in inland waters over a large-scale area using satellite data and machine learning (ML) methods. Four ML approaches, namely Random Forest Regression (RFR), Support Vector Regression (SVR), Gaussian Process Regression (GPR), and a Multilayer Backpropagation Neural Network (MBPNN) were tested to retrieve DOC using a filtered version of the recently published open source AquaSat dataset with more than 16 thousand samples across the continental US matched with satellite data from Landsat 5, 7 and 8 missions. In this work, the AquaSat dataset was extended with environmental data from the ERA5-Land product.
Including environmental data considerably improved the prediction of DOC for all algorithms, with GPR showing the best and most robust performance results with moderate estimation errors (RMSE: 4.08 mg/L). Permutation feature importance analysis showed that from the Landsat bands, the wavelength in the visible green and for the ERA5-Land product, the monthly average air temperature were the most important variables for the machine learning approaches. The results demonstrate the predictive strength of advanced ML approaches faced with a complex learning task, such as GPR and MBPNN, and highlight the important role of considering environmental processes to explain DOC variations over large scales.
While performance evaluation showed that DOC concentrations can be retrieved with adequate accuracy, algorithm development was challenged by the heterogenous nature of large-scale open source in situ data, issues related to atmospheric correction, and the low spatial and temporal resolution of the environmental predictors. Although locally tuned models are likely to outperform the developed model in terms of accuracy, the model can address key issues of inland water remote sensing as a promising approach to overcome the lack of in-situ measurements and to map large scale trends of inland DOC dynamically over long time periods and seasons.
This research demonstrates how open source, large scale datasets like AquaSat in combination with ML and remote sensing can make research toward large scale estimations of inland water DOC more realistic while highlighting its remaining limitations and challenges. (Less)
Popular Abstract
In this thesis an attempt was made to estimate a water parameter called “Dissolved Organic Carbon” (DOC) from inland waters with satellite data for a large area. DOC in inland waters plays an important role for the global carbon cycle and has public health effects. Therefore, it is important to find a way to monitor DOC content in inland waters. For this, four Machine Learning (ML) algorithms were trained. Machine learning algorithms are computer techniques that can learn a relationship between features by giving it lots of data about these features. Once the algorithms are trained, they can make an estimation on new data. A large dataset of DOC measurements combined with data from satellites has been used to train the ML algorithms to... (More)
In this thesis an attempt was made to estimate a water parameter called “Dissolved Organic Carbon” (DOC) from inland waters with satellite data for a large area. DOC in inland waters plays an important role for the global carbon cycle and has public health effects. Therefore, it is important to find a way to monitor DOC content in inland waters. For this, four Machine Learning (ML) algorithms were trained. Machine learning algorithms are computer techniques that can learn a relationship between features by giving it lots of data about these features. Once the algorithms are trained, they can make an estimation on new data. A large dataset of DOC measurements combined with data from satellites has been used to train the ML algorithms to predict DOC from satellite data. In addition to the satellite data, environmental variables such as temperature was used to help the ML algorithms to make a better prediction.
The result in the thesis showed that adding environmental variables was important to improve the estimations of DOC made by the ML algorithms. A ML algorithm called Gaussian Process Regression had the best overall performance of the four ML algorithms. An analysis of the importance of the variables showed that the band in the green wavelength range of the satellite and the environmental variable temperature were the most important variables for the ML algorithms to estimate DOC.
The sample dataset was not of high quality which made the estimations more difficult. ML algorithms that are trained on a single waterbody are probably more precise than the model developed in this study. However, the model developed in this project has the potential to map DOC over large scales and overcome the common problem that too few samples in waterbodies exist to analyse their DOC dynamics.
Overall, the research shows how large sample datasets in combination with ML and satellite data can be used for large scale estimations of inland water DOC. At the same time the research underlines the remaining limitations and challenges for this type of application. (Less)
Please use this url to cite or link to this publication:
author
Harkort, Lasse LU
supervisor
organization
course
NGEM01 20221
year
type
H2 - Master's Degree (Two Years)
subject
keywords
dissolved organic carbon, machine learning, remote sensing, inland waters, water quality, open source data
publication/series
Student thesis series INES
report number
576
language
English
id
9092680
date added to LUP
2022-06-23 11:03:52
date last changed
2022-06-23 11:03:52
@misc{9092680,
  abstract     = {{This thesis presents the first attempt to estimate Dissolved Organic Carbon (DOC) in inland waters over a large-scale area using satellite data and machine learning (ML) methods. Four ML approaches, namely Random Forest Regression (RFR), Support Vector Regression (SVR), Gaussian Process Regression (GPR), and a Multilayer Backpropagation Neural Network (MBPNN) were tested to retrieve DOC using a filtered version of the recently published open source AquaSat dataset with more than 16 thousand samples across the continental US matched with satellite data from Landsat 5, 7 and 8 missions. In this work, the AquaSat dataset was extended with environmental data from the ERA5-Land product. 
Including environmental data considerably improved the prediction of DOC for all algorithms, with GPR showing the best and most robust performance results with moderate estimation errors (RMSE: 4.08 mg/L). Permutation feature importance analysis showed that from the Landsat bands, the wavelength in the visible green and for the ERA5-Land product, the monthly average air temperature were the most important variables for the machine learning approaches. The results demonstrate the predictive strength of advanced ML approaches faced with a complex learning task, such as GPR and MBPNN, and highlight the important role of considering environmental processes to explain DOC variations over large scales. 
While performance evaluation showed that DOC concentrations can be retrieved with adequate accuracy, algorithm development was challenged by the heterogenous nature of large-scale open source in situ data, issues related to atmospheric correction, and the low spatial and temporal resolution of the environmental predictors. Although locally tuned models are likely to outperform the developed model in terms of accuracy, the model can address key issues of inland water remote sensing as a promising approach to overcome the lack of in-situ measurements and to map large scale trends of inland DOC dynamically over long time periods and seasons. 
This research demonstrates how open source, large scale datasets like AquaSat in combination with ML and remote sensing can make research toward large scale estimations of inland water DOC more realistic while highlighting its remaining limitations and challenges.}},
  author       = {{Harkort, Lasse}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Student thesis series INES}},
  title        = {{Estimation of dissolved organic carbon from inland waters using remote sensing data and machine learning}},
  year         = {{2022}},
}