Multiclass risk models for ovarian malignancy : an illustration of prediction uncertainty due to the choice of algorithm

Ledger, Ashleigh; Ceusters, Jolien; Valentin, Lil; Testa, Antonia; Van Holsbeke, Caroline; Franchi, Dorella; Bourne, Tom; Froyman, Wouter; Timmerman, Dirk; Van Calster, Ben

Multiclass risk models for ovarian malignancy : an illustration of prediction uncertainty due to the choice of algorithm

Mark

Ledger, Ashleigh ; Ceusters, Jolien ; Valentin, Lil ^LU

; Testa, Antonia ; Van Holsbeke, Caroline ; Franchi, Dorella ; Bourne, Tom ; Froyman, Wouter ; Timmerman, Dirk and Van Calster, Ben (2023) In BMC Medical Research Methodology 23(1).

Abstract: Background: Assessing malignancy risk is important to choose appropriate management of ovarian tumors. We compared six algorithms to estimate the probabilities that an ovarian tumor is benign, borderline malignant, stage I primary invasive, stage II-IV primary invasive, or secondary metastatic. Methods: This retrospective cohort study used 5909 patients recruited from 1999 to 2012 for model development, and 3199 patients recruited from 2012 to 2015 for model validation. Patients were recruited at oncology referral or general centers and underwent an ultrasound examination and surgery ≤ 120 days later. We developed models using standard multinomial logistic regression (MLR), Ridge MLR, random forest (RF), XGBoost, neural networks (NN),... (More); Background: Assessing malignancy risk is important to choose appropriate management of ovarian tumors. We compared six algorithms to estimate the probabilities that an ovarian tumor is benign, borderline malignant, stage I primary invasive, stage II-IV primary invasive, or secondary metastatic. Methods: This retrospective cohort study used 5909 patients recruited from 1999 to 2012 for model development, and 3199 patients recruited from 2012 to 2015 for model validation. Patients were recruited at oncology referral or general centers and underwent an ultrasound examination and surgery ≤ 120 days later. We developed models using standard multinomial logistic regression (MLR), Ridge MLR, random forest (RF), XGBoost, neural networks (NN), and support vector machines (SVM). We used nine clinical and ultrasound predictors but developed models with or without CA125. Results: Most tumors were benign (3980 in development and 1688 in validation data), secondary metastatic tumors were least common (246 and 172). The c-statistic (AUROC) to discriminate benign from any type of malignant tumor ranged from 0.89 to 0.92 for models with CA125, from 0.89 to 0.91 for models without. The multiclass c-statistic ranged from 0.41 (SVM) to 0.55 (XGBoost) for models with CA125, and from 0.42 (SVM) to 0.51 (standard MLR) for models without. Multiclass calibration was best for RF and XGBoost. Estimated probabilities for a benign tumor in the same patient often differed by more than 0.2 (20% points) depending on the model. Net Benefit for diagnosing malignancy was similar for algorithms at the commonly used 10% risk threshold, but was slightly higher for RF at higher thresholds. Comparing models, between 3% (XGBoost vs. NN, with CA125) and 30% (NN vs. SVM, without CA125) of patients fell on opposite sides of the 10% threshold. Conclusion: Although several models had similarly good performance, individual probability estimates varied substantially.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/a72b31e4-7b50-4dd3-8a42-fb2418b3f722

author

Ledger, Ashleigh ; Ceusters, Jolien ; Valentin, Lil ^LU

; Testa, Antonia ; Van Holsbeke, Caroline ; Franchi, Dorella ; Bourne, Tom ; Froyman, Wouter ; Timmerman, Dirk and Van Calster, Ben

organization

Obstetric, Gynaecological and Prenatal Ultrasound Research (research group)

publishing date

2023

type

Contribution to journal

publication status

published

subject

Cancer and Oncology

keywords

Calibration, Machine learning, Multiclass models, Ovarian Neoplasms, Prediction models

in

BMC Medical Research Methodology

volume

23

issue

1

article number

276

publisher

BioMed Central (BMC)

external identifiers

scopus:85177765574
pmid:38001421

ISSN

1471-2288

DOI

10.1186/s12874-023-02103-3

language

English

LU publication?

yes

id

a72b31e4-7b50-4dd3-8a42-fb2418b3f722

date added to LUP

2023-12-20 15:32:33

date last changed

2026-02-08 13:58:27

@article{a72b31e4-7b50-4dd3-8a42-fb2418b3f722,
  abstract     = {{<p>Background: Assessing malignancy risk is important to choose appropriate management of ovarian tumors. We compared six algorithms to estimate the probabilities that an ovarian tumor is benign, borderline malignant, stage I primary invasive, stage II-IV primary invasive, or secondary metastatic. Methods: This retrospective cohort study used 5909 patients recruited from 1999 to 2012 for model development, and 3199 patients recruited from 2012 to 2015 for model validation. Patients were recruited at oncology referral or general centers and underwent an ultrasound examination and surgery ≤ 120 days later. We developed models using standard multinomial logistic regression (MLR), Ridge MLR, random forest (RF), XGBoost, neural networks (NN), and support vector machines (SVM). We used nine clinical and ultrasound predictors but developed models with or without CA125. Results: Most tumors were benign (3980 in development and 1688 in validation data), secondary metastatic tumors were least common (246 and 172). The c-statistic (AUROC) to discriminate benign from any type of malignant tumor ranged from 0.89 to 0.92 for models with CA125, from 0.89 to 0.91 for models without. The multiclass c-statistic ranged from 0.41 (SVM) to 0.55 (XGBoost) for models with CA125, and from 0.42 (SVM) to 0.51 (standard MLR) for models without. Multiclass calibration was best for RF and XGBoost. Estimated probabilities for a benign tumor in the same patient often differed by more than 0.2 (20% points) depending on the model. Net Benefit for diagnosing malignancy was similar for algorithms at the commonly used 10% risk threshold, but was slightly higher for RF at higher thresholds. Comparing models, between 3% (XGBoost vs. NN, with CA125) and 30% (NN vs. SVM, without CA125) of patients fell on opposite sides of the 10% threshold. Conclusion: Although several models had similarly good performance, individual probability estimates varied substantially.</p>}},
  author       = {{Ledger, Ashleigh and Ceusters, Jolien and Valentin, Lil and Testa, Antonia and Van Holsbeke, Caroline and Franchi, Dorella and Bourne, Tom and Froyman, Wouter and Timmerman, Dirk and Van Calster, Ben}},
  issn         = {{1471-2288}},
  keywords     = {{Calibration; Machine learning; Multiclass models; Ovarian Neoplasms; Prediction models}},
  language     = {{eng}},
  number       = {{1}},
  publisher    = {{BioMed Central (BMC)}},
  series       = {{BMC Medical Research Methodology}},
  title        = {{Multiclass risk models for ovarian malignancy : an illustration of prediction uncertainty due to the choice of algorithm}},
  url          = {{http://dx.doi.org/10.1186/s12874-023-02103-3}},
  doi          = {{10.1186/s12874-023-02103-3}},
  volume       = {{23}},
  year         = {{2023}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Multiclass risk models for ovarian malignancy : an illustration of prediction uncertainty due to the choice of algorithm