Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Estimation of dissolved organic carbon from inland waters at a large scale using satellite data and machine learning methods

Harkort, Lasse LU and Duan, Zheng LU (2023) In Water Research 229.
Abstract

Dissolved Organic Carbon (DOC) in inland waters plays an essential role in the global carbon cycle and has significant public health effects. Machine learning (ML) together with remote sensing has emerged as a powerful and promising combination to quantify water quality parameters from space. However, inland water sample data for DOC is limited. Hence, little is known about the potential to quantify DOC content in inland waters, especially over large-scale areas. This study presents the first attempt to estimate DOC in inland waters over a large-scale area using satellite data and ML methods with the newly published open-source dataset AquaSat. Four ML approaches, namely Random Forest Regression (RFR), Support Vector Regression (SVR),... (More)

Dissolved Organic Carbon (DOC) in inland waters plays an essential role in the global carbon cycle and has significant public health effects. Machine learning (ML) together with remote sensing has emerged as a powerful and promising combination to quantify water quality parameters from space. However, inland water sample data for DOC is limited. Hence, little is known about the potential to quantify DOC content in inland waters, especially over large-scale areas. This study presents the first attempt to estimate DOC in inland waters over a large-scale area using satellite data and ML methods with the newly published open-source dataset AquaSat. Four ML approaches, namely Random Forest Regression (RFR), Support Vector Regression (SVR), Gaussian Process Regression (GPR), and a Multilayer Backpropagation Neural Network (MBPNN) were trained using more than 16 thousand samples across the continental United States matched with satellite data from Landsat 5, 7 and 8 missions. Satellite data from the Landsat missions were further extended with environmental data from the ERA5-Land product and used as input to train the ML algorithms. Our results show that including environmental data as inputs considerably improved the prediction of DOC for all ML algorithms, with GPR showing the most promising performance results with moderate estimation errors (RMSE: 4.08 mg/L). Permutation feature importance analysis showed that the wavelength range in the visible Green band (from Landsat) and the monthly average air temperature (from ERA5-Land) were the most important variables for the ML approaches. The results demonstrate the predictive strength of GPR and its useful feature to derive per pixel standard deviations for detailed analysis. Our results further highlight the important role of considering environmental processes to explain DOC variations over large scales. The application and performance of the GPR in mapping spatiotemporal variations of DOC in an entire water body were discussed by taking Lake Okeechobee (the 8th largest freshwater lake in the U.S.) as an illustrative example. While performance evaluation showed that DOC concentrations can be retrieved with adequate accuracy, algorithm development was challenged by the heterogenous nature of large-scale open source in situ data, issues related to atmospheric correction, and the low spatial and temporal resolution of the environmental predictors. This research demonstrates how open source, large-scale datasets like AquaSat in combination with ML and satellite remote sensing can make research toward large-scale estimation of inland water DOC more realistic while highlighting its remaining limitations and challenges.

(Less)
Please use this url to cite or link to this publication:
author
and
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Dissolved organic carbon, Landsat, Machine learning, Open source data, Remote sensing, Water quality
in
Water Research
volume
229
article number
119478
publisher
Elsevier
external identifiers
  • scopus:85144273532
  • pmid:36527868
ISSN
0043-1354
DOI
10.1016/j.watres.2022.119478
language
English
LU publication?
yes
id
edc8c04a-8b4d-4951-8a4b-0bead1aae1b6
date added to LUP
2023-02-02 11:54:02
date last changed
2024-06-14 13:22:27
@article{edc8c04a-8b4d-4951-8a4b-0bead1aae1b6,
  abstract     = {{<p>Dissolved Organic Carbon (DOC) in inland waters plays an essential role in the global carbon cycle and has significant public health effects. Machine learning (ML) together with remote sensing has emerged as a powerful and promising combination to quantify water quality parameters from space. However, inland water sample data for DOC is limited. Hence, little is known about the potential to quantify DOC content in inland waters, especially over large-scale areas. This study presents the first attempt to estimate DOC in inland waters over a large-scale area using satellite data and ML methods with the newly published open-source dataset AquaSat. Four ML approaches, namely Random Forest Regression (RFR), Support Vector Regression (SVR), Gaussian Process Regression (GPR), and a Multilayer Backpropagation Neural Network (MBPNN) were trained using more than 16 thousand samples across the continental United States matched with satellite data from Landsat 5, 7 and 8 missions. Satellite data from the Landsat missions were further extended with environmental data from the ERA5-Land product and used as input to train the ML algorithms. Our results show that including environmental data as inputs considerably improved the prediction of DOC for all ML algorithms, with GPR showing the most promising performance results with moderate estimation errors (RMSE: 4.08 mg/L). Permutation feature importance analysis showed that the wavelength range in the visible Green band (from Landsat) and the monthly average air temperature (from ERA5-Land) were the most important variables for the ML approaches. The results demonstrate the predictive strength of GPR and its useful feature to derive per pixel standard deviations for detailed analysis. Our results further highlight the important role of considering environmental processes to explain DOC variations over large scales. The application and performance of the GPR in mapping spatiotemporal variations of DOC in an entire water body were discussed by taking Lake Okeechobee (the 8<sup>th</sup> largest freshwater lake in the U.S.) as an illustrative example. While performance evaluation showed that DOC concentrations can be retrieved with adequate accuracy, algorithm development was challenged by the heterogenous nature of large-scale open source in situ data, issues related to atmospheric correction, and the low spatial and temporal resolution of the environmental predictors. This research demonstrates how open source, large-scale datasets like AquaSat in combination with ML and satellite remote sensing can make research toward large-scale estimation of inland water DOC more realistic while highlighting its remaining limitations and challenges.</p>}},
  author       = {{Harkort, Lasse and Duan, Zheng}},
  issn         = {{0043-1354}},
  keywords     = {{Dissolved organic carbon; Landsat; Machine learning; Open source data; Remote sensing; Water quality}},
  language     = {{eng}},
  month        = {{02}},
  publisher    = {{Elsevier}},
  series       = {{Water Research}},
  title        = {{Estimation of dissolved organic carbon from inland waters at a large scale using satellite data and machine learning methods}},
  url          = {{http://dx.doi.org/10.1016/j.watres.2022.119478}},
  doi          = {{10.1016/j.watres.2022.119478}},
  volume       = {{229}},
  year         = {{2023}},
}