Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Random effects during training : Implications for deep learning-based medical image segmentation

Åkesson, Julius LU ; Töger, Johannes LU and Heiberg, Einar LU (2024) In Computers in Biology and Medicine 180.
Abstract

Background: A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models. Methods: The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on... (More)

Background: A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models. Methods: The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on this, segmentation performance was assessed using both hold-out validation and 5-fold cross-validation and the statistical significance of performance differences was measured using the Paired t-test and the Wilcoxon signed rank test on Dice scores. Results: For the different segmentation problems, the seed producing the highest mean Dice score statistically significantly outperformed between 0 % and 76 % of the remaining seeds when estimating performance using hold-out validation, and between 10 % and 38 % when estimating performance using 5-fold cross-validation. Conclusion: Random effects during training can cause high rates of statistically-significant performance differences between segmentation models from the same learning algorithm. Whilst statistical testing is widely used in contemporary literature, our results indicate that a statistically-significant difference in segmentation performance is a weak and unreliable indicator of a true performance difference between two learning algorithms.

(Less)
Please use this url to cite or link to this publication:
author
; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Deep learning, Medical image segmentation, Performance comparisons, Random seeds, Randomness
in
Computers in Biology and Medicine
volume
180
article number
108944
publisher
Elsevier
external identifiers
  • pmid:39096609
  • scopus:85200221581
ISSN
0010-4825
DOI
10.1016/j.compbiomed.2024.108944
language
English
LU publication?
yes
id
cf0176e7-ae75-466b-9380-08041058a0c3
date added to LUP
2024-09-02 15:43:43
date last changed
2024-09-02 15:44:20
@article{cf0176e7-ae75-466b-9380-08041058a0c3,
  abstract     = {{<p>Background: A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models. Methods: The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on this, segmentation performance was assessed using both hold-out validation and 5-fold cross-validation and the statistical significance of performance differences was measured using the Paired t-test and the Wilcoxon signed rank test on Dice scores. Results: For the different segmentation problems, the seed producing the highest mean Dice score statistically significantly outperformed between 0 % and 76 % of the remaining seeds when estimating performance using hold-out validation, and between 10 % and 38 % when estimating performance using 5-fold cross-validation. Conclusion: Random effects during training can cause high rates of statistically-significant performance differences between segmentation models from the same learning algorithm. Whilst statistical testing is widely used in contemporary literature, our results indicate that a statistically-significant difference in segmentation performance is a weak and unreliable indicator of a true performance difference between two learning algorithms.</p>}},
  author       = {{Åkesson, Julius and Töger, Johannes and Heiberg, Einar}},
  issn         = {{0010-4825}},
  keywords     = {{Deep learning; Medical image segmentation; Performance comparisons; Random seeds; Randomness}},
  language     = {{eng}},
  publisher    = {{Elsevier}},
  series       = {{Computers in Biology and Medicine}},
  title        = {{Random effects during training : Implications for deep learning-based medical image segmentation}},
  url          = {{http://dx.doi.org/10.1016/j.compbiomed.2024.108944}},
  doi          = {{10.1016/j.compbiomed.2024.108944}},
  volume       = {{180}},
  year         = {{2024}},
}