Deep learning model ensemble for remote sensing land use classification

Yang, Jingfan

Deep learning model ensemble for remote sensing land use classification

Mark

Yang, Jingfan ^LU (2025) In Student thesis series INES NGEM01 20231
Dept of Physical Geography and Ecosystem Science

Abstract: This study investigates deep learning approaches for automated land use classification from high-resolution remote sensing imagery, comparing Convolutional Neural Network (CNN) and Vision Transformer (ViT) architectures.
Eight semantic segmentation models were evaluated on the HRSCD dataset containing 291 aerial image pairs (0.5m resolution) with four land use classes: building, agricultural, forest, and water. Models included CNN-based architectures (Deeplabv3+, HarDNet, STDC1-Seg50, STDC2-Seg50) and Transformer-based architectures (SETR, SegMenter, SegFormer-B1, SegFormer-B5).
Transformer models demonstrated superior performance, with SegMenter achieving highest accuracy on Test Set I: Mean Pixel Accuracy (MPA) 94.22%, Mean... (More); This study investigates deep learning approaches for automated land use classification from high-resolution remote sensing imagery, comparing Convolutional Neural Network (CNN) and Vision Transformer (ViT) architectures.
Eight semantic segmentation models were evaluated on the HRSCD dataset containing 291 aerial image pairs (0.5m resolution) with four land use classes: building, agricultural, forest, and water. Models included CNN-based architectures (Deeplabv3+, HarDNet, STDC1-Seg50, STDC2-Seg50) and Transformer-based architectures (SETR, SegMenter, SegFormer-B1, SegFormer-B5).
Transformer models demonstrated superior performance, with SegMenter achieving highest accuracy on Test Set I: Mean Pixel Accuracy (MPA) 94.22%, Mean Intersection over Union (MIoU) 78.67%, and Mean F1-Score 87.77%. All models showed degraded performance on Test Set II, with average decreases of 3.79%, 10.87%, and 9.68% for the three metrics respectively, indicating domain shift sensitivity.
The optimal ensemble configuration combined three models (STDC2-Seg50, SegMenter, SegFormer-B1) using weighted averaging, achieving MPA 94.42%, MIoU 81.69%, and MF1-Score 89.71%. This represented improvements of 2.32%, 8.51%, and 5.81% over the best single model. Traditional classification methods (ISODATA, Maximum Likelihood, Random Forest, Support Vector Machine) achieved significantly lower comprehensive scores (31.20%-61.55%) compared to the ensemble's 88.61%.
The superior performance of Transformers is attributed to self-attention mechanisms enabling better global context modeling. The ensemble approach successfully integrated complementary strengths: CNN models provided robust local feature extraction while Transformers contributed global semantic understanding. Water class classification remained challenging due to severe data imbalance.
This research demonstrates that multi-model ensembles combining CNN and Transformer architectures provide a robust framework for operational land use mapping, though challenges persist in handling extreme class imbalance and cross-domain generalization. (Less)
Popular Abstract: When you look at satellite images of Earth, you can see forests, cities, farmland, and water bodies. But teaching computers to automatically identify these different land types from thousands of images is challenging. This research tackles this problem using artificial intelligence.
Traditional methods for classifying land use rely on basic image features like color and texture. While these work reasonably well, they struggle with today's high-resolution satellite images that contain complex details. This is where deep learning comes in – a type of AI that can learn to recognize patterns much like humans do.

This study tested eight different deep learning models on satellite images from France. The models fall into two main categories:... (More); When you look at satellite images of Earth, you can see forests, cities, farmland, and water bodies. But teaching computers to automatically identify these different land types from thousands of images is challenging. This research tackles this problem using artificial intelligence.
Traditional methods for classifying land use rely on basic image features like color and texture. While these work reasonably well, they struggle with today's high-resolution satellite images that contain complex details. This is where deep learning comes in – a type of AI that can learn to recognize patterns much like humans do.

This study tested eight different deep learning models on satellite images from France. The models fall into two main categories: CNN (Convolutional Neural Network), which have been the standard for image analysis, and Transformer, a newer technology originally developed for language processing but now adapted for images.
The research found that Transformer models generally performed better than CNNs, especially when dealing with imbalanced data – for instance, when there are many forest images but few water images in the dataset. However, individual models still had limitations.
To improve accuracy, the study combined multiple models using ensemble methods – essentially having different AI models "vote" on what they see. The best ensemble combined three models (STDC2-Seg50, SegMenter, and SegFormer-B1) and achieved about 89% accuracy in classifying land use types.
When compared to traditional classification methods like Random Forest and Support Vector Machines, the deep learning ensemble performed significantly better. The ensemble achieved an overall score of 88.61%, while traditional methods ranged from 31% to 62%.
This research has practical applications for environmental monitoring, urban planning, and agricultural management. The ensemble approach shows that combining different AI models can overcome individual weaknesses and provide more reliable results for real-world applications. (Less)

- Open Access
- |
- PDF

Links

Document download statistics

Related Materials

Related object is supplementary material:
Popular Summary
Related object is supplementary material:
Scientific Summary

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9202884

author

Yang, Jingfan ^LU

supervisor

Lars Eklundh ^LU

organization

Dept of Physical Geography and Ecosystem Science

course

NGEM01 20231

year

2025

type

H2 - Master's Degree (Two Years)

subject

Earth and Environmental Sciences

keywords

Physical Geography and Ecosystem analysis, Land Use Classification, Deep Learning, Convolutional Neural Networks, Transformer

publication/series

Student thesis series INES

report number

746

language

English

id

9202884

date added to LUP

2025-06-19 11:39:39

date last changed

2025-06-19 11:39:39

@misc{9202884,
  abstract     = {{This study investigates deep learning approaches for automated land use classification from high-resolution remote sensing imagery, comparing Convolutional Neural Network (CNN) and Vision Transformer (ViT) architectures.
Eight semantic segmentation models were evaluated on the HRSCD dataset containing 291 aerial image pairs (0.5m resolution) with four land use classes: building, agricultural, forest, and water. Models included CNN-based architectures (Deeplabv3+, HarDNet, STDC1-Seg50, STDC2-Seg50) and Transformer-based architectures (SETR, SegMenter, SegFormer-B1, SegFormer-B5).
Transformer models demonstrated superior performance, with SegMenter achieving highest accuracy on Test Set I: Mean Pixel Accuracy (MPA) 94.22%, Mean Intersection over Union (MIoU) 78.67%, and Mean F1-Score 87.77%. All models showed degraded performance on Test Set II, with average decreases of 3.79%, 10.87%, and 9.68% for the three metrics respectively, indicating domain shift sensitivity.
The optimal ensemble configuration combined three models (STDC2-Seg50, SegMenter, SegFormer-B1) using weighted averaging, achieving MPA 94.42%, MIoU 81.69%, and MF1-Score 89.71%. This represented improvements of 2.32%, 8.51%, and 5.81% over the best single model. Traditional classification methods (ISODATA, Maximum Likelihood, Random Forest, Support Vector Machine) achieved significantly lower comprehensive scores (31.20%-61.55%) compared to the ensemble's 88.61%.
The superior performance of Transformers is attributed to self-attention mechanisms enabling better global context modeling. The ensemble approach successfully integrated complementary strengths: CNN models provided robust local feature extraction while Transformers contributed global semantic understanding. Water class classification remained challenging due to severe data imbalance.
This research demonstrates that multi-model ensembles combining CNN and Transformer architectures provide a robust framework for operational land use mapping, though challenges persist in handling extreme class imbalance and cross-domain generalization.}},
  author       = {{Yang, Jingfan}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Student thesis series INES}},
  title        = {{Deep learning model ensemble for remote sensing land use classification}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Deep learning model ensemble for remote sensing land use classification