Random effects during training : Implications for deep learning-based medical image segmentation

Åkesson, Julius; Töger, Johannes; Heiberg, Einar

Random effects during training : Implications for deep learning-based medical image segmentation

Mark

Åkesson, Julius ^LU ; Töger, Johannes ^LU

and Heiberg, Einar ^LU

(2024) In Computers in Biology and Medicine 180.

Abstract: Background: A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models. Methods: The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on... (More); Background: A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models. Methods: The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on this, segmentation performance was assessed using both hold-out validation and 5-fold cross-validation and the statistical significance of performance differences was measured using the Paired t-test and the Wilcoxon signed rank test on Dice scores. Results: For the different segmentation problems, the seed producing the highest mean Dice score statistically significantly outperformed between 0 % and 76 % of the remaining seeds when estimating performance using hold-out validation, and between 10 % and 38 % when estimating performance using 5-fold cross-validation. Conclusion: Random effects during training can cause high rates of statistically-significant performance differences between segmentation models from the same learning algorithm. Whilst statistical testing is widely used in contemporary literature, our results indicate that a statistically-significant difference in segmentation performance is a weak and unreliable indicator of a true performance difference between two learning algorithms.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/cf0176e7-ae75-466b-9380-08041058a0c3

author

Åkesson, Julius ^LU ; Töger, Johannes ^LU

and Heiberg, Einar ^LU

organization

publishing date

2024-09

type

Contribution to journal

publication status

published

subject

Probability Theory and Statistics

keywords

Deep learning, Medical image segmentation, Performance comparisons, Random seeds, Randomness

in

Computers in Biology and Medicine

volume

180

article number

108944

publisher

Elsevier

external identifiers

scopus:85200221581
pmid:39096609

ISSN

0010-4825

DOI

10.1016/j.compbiomed.2024.108944

language

English

LU publication?

yes

id

cf0176e7-ae75-466b-9380-08041058a0c3

date added to LUP

2024-09-02 15:43:43

date last changed

2025-07-22 21:04:44

@article{cf0176e7-ae75-466b-9380-08041058a0c3,
  abstract     = {{<p>Background: A single learning algorithm can produce deep learning-based image segmentation models that vary in performance purely due to random effects during training. This study assessed the effect of these random performance fluctuations on the reliability of standard methods of comparing segmentation models. Methods: The influence of random effects during training was assessed by running a single learning algorithm (nnU-Net) with 50 different random seeds for three multiclass 3D medical image segmentation problems, including brain tumour, hippocampus, and cardiac segmentation. Recent literature was sampled to find the most common methods for estimating and comparing the performance of deep learning segmentation models. Based on this, segmentation performance was assessed using both hold-out validation and 5-fold cross-validation and the statistical significance of performance differences was measured using the Paired t-test and the Wilcoxon signed rank test on Dice scores. Results: For the different segmentation problems, the seed producing the highest mean Dice score statistically significantly outperformed between 0 % and 76 % of the remaining seeds when estimating performance using hold-out validation, and between 10 % and 38 % when estimating performance using 5-fold cross-validation. Conclusion: Random effects during training can cause high rates of statistically-significant performance differences between segmentation models from the same learning algorithm. Whilst statistical testing is widely used in contemporary literature, our results indicate that a statistically-significant difference in segmentation performance is a weak and unreliable indicator of a true performance difference between two learning algorithms.</p>}},
  author       = {{Åkesson, Julius and Töger, Johannes and Heiberg, Einar}},
  issn         = {{0010-4825}},
  keywords     = {{Deep learning; Medical image segmentation; Performance comparisons; Random seeds; Randomness}},
  language     = {{eng}},
  publisher    = {{Elsevier}},
  series       = {{Computers in Biology and Medicine}},
  title        = {{Random effects during training : Implications for deep learning-based medical image segmentation}},
  url          = {{http://dx.doi.org/10.1016/j.compbiomed.2024.108944}},
  doi          = {{10.1016/j.compbiomed.2024.108944}},
  volume       = {{180}},
  year         = {{2024}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Random effects during training : Implications for deep learning-based medical image segmentation