Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

A combined strategy of feature selection and machine learning to identify predictors of prediabetes.

De Silva, Kushan ; Jönsson, Daniel LU and Demmer, Ryan T (2020) In Journal of American Medical Informatics Association 27(3). p.369-406
Abstract
Objective: To identify predictors of prediabetes using feature selection and machine learning on a nationally representative sample of the US population.
Materials and methods: We analyzed n = 6346 men and women enrolled in the National Health and Nutrition Examination Survey 2013-2014. Prediabetes was defined using American Diabetes Association guidelines. The sample was randomly partitioned to training (n = 3174) and internal validation (n = 3172) sets. Feature selection algorithms were run on training data containing 156 preselected exposure variables. Four machine learning algorithms were applied on 46 exposure variables in original and resampled training datasets built using 4 resampling methods. Predictive models were tested on... (More)
Objective: To identify predictors of prediabetes using feature selection and machine learning on a nationally representative sample of the US population.
Materials and methods: We analyzed n = 6346 men and women enrolled in the National Health and Nutrition Examination Survey 2013-2014. Prediabetes was defined using American Diabetes Association guidelines. The sample was randomly partitioned to training (n = 3174) and internal validation (n = 3172) sets. Feature selection algorithms were run on training data containing 156 preselected exposure variables. Four machine learning algorithms were applied on 46 exposure variables in original and resampled training datasets built using 4 resampling methods. Predictive models were tested on internal validation data (n = 3172) and external validation data (n = 3000) prepared from National Health and Nutrition Examination Survey 2011-2012. Model performance was evaluated using area under the receiver operating characteristic curve (AUROC). Predictors were assessed by odds ratios in logistic models and variable importance in others. The Centers for Disease Control (CDC) prediabetes screening tool was the benchmark to compare model performance.
Results: Prediabetes prevalence was 23.43%. The CDC prediabetes screening tool produced 64.40% AUROC. Seven optimal (≥ 70% AUROC) models identified 25 predictors including 4 potentially novel associations; 20 by both logistic and other nonlinear/ensemble models and 5 solely by the latter. All optimal models outperformed the CDC prediabetes screening tool (P < 0.05).
Discussion: Combined use of feature selection and machine learning increased predictive performance outperforming the recommended screening tool. A range of predictors of prediabetes was identified.

Conclusion: This work demonstrated the value of combining feature selection with machine learning to identify a wide range of predictors that could enhance prediabetes prediction and clinical decision-making. (Less)
Abstract (Swedish)
Objective: To identify predictors of prediabetes using feature selection and machine learning on a nationally representative sample of the US population.

Materials and methods: We analyzed n = 6346 men and women enrolled in the National Health and Nutrition Examination Survey 2013-2014. Prediabetes was defined using American Diabetes Association guidelines. The sample was randomly partitioned to training (n = 3174) and internal validation (n = 3172) sets. Feature selection algorithms were run on training data containing 156 preselected exposure variables. Four machine learning algorithms were applied on 46 exposure variables in original and resampled training datasets built using 4 resampling methods. Predictive models were tested... (More)
Objective: To identify predictors of prediabetes using feature selection and machine learning on a nationally representative sample of the US population.

Materials and methods: We analyzed n = 6346 men and women enrolled in the National Health and Nutrition Examination Survey 2013-2014. Prediabetes was defined using American Diabetes Association guidelines. The sample was randomly partitioned to training (n = 3174) and internal validation (n = 3172) sets. Feature selection algorithms were run on training data containing 156 preselected exposure variables. Four machine learning algorithms were applied on 46 exposure variables in original and resampled training datasets built using 4 resampling methods. Predictive models were tested on internal validation data (n = 3172) and external validation data (n = 3000) prepared from National Health and Nutrition Examination Survey 2011-2012. Model performance was evaluated using area under the receiver operating characteristic curve (AUROC). Predictors were assessed by odds ratios in logistic models and variable importance in others. The Centers for Disease Control (CDC) prediabetes screening tool was the benchmark to compare model performance.

Results: Prediabetes prevalence was 23.43%. The CDC prediabetes screening tool produced 64.40% AUROC. Seven optimal (≥ 70% AUROC) models identified 25 predictors including 4 potentially novel associations; 20 by both logistic and other nonlinear/ensemble models and 5 solely by the latter. All optimal models outperformed the CDC prediabetes screening tool (P < 0.05).

Discussion: Combined use of feature selection and machine learning increased predictive performance outperforming the recommended screening tool. A range of predictors of prediabetes was identified.

Conclusion: This work demonstrated the value of combining feature selection with machine learning to identify a wide range of predictors that could enhance prediabetes prediction and clinical decision-making. (Less)
Please use this url to cite or link to this publication:
author
; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
in
Journal of American Medical Informatics Association
volume
27
issue
3
pages
369 - 406
publisher
Oxford University Press
external identifiers
  • scopus:85079353320
  • pmid:31889178
ISSN
1527-974X
DOI
10.1093/jamia/ocz204
language
English
LU publication?
no
id
f8c460f5-3472-4bfc-a451-a4719d78bb77
date added to LUP
2020-12-01 14:28:03
date last changed
2022-05-12 08:12:50
@article{f8c460f5-3472-4bfc-a451-a4719d78bb77,
  abstract     = {{Objective: To identify predictors of prediabetes using feature selection and machine learning on a nationally representative sample of the US population.<br>
Materials and methods: We analyzed n = 6346 men and women enrolled in the National Health and Nutrition Examination Survey 2013-2014. Prediabetes was defined using American Diabetes Association guidelines. The sample was randomly partitioned to training (n = 3174) and internal validation (n = 3172) sets. Feature selection algorithms were run on training data containing 156 preselected exposure variables. Four machine learning algorithms were applied on 46 exposure variables in original and resampled training datasets built using 4 resampling methods. Predictive models were tested on internal validation data (n = 3172) and external validation data (n = 3000) prepared from National Health and Nutrition Examination Survey 2011-2012. Model performance was evaluated using area under the receiver operating characteristic curve (AUROC). Predictors were assessed by odds ratios in logistic models and variable importance in others. The Centers for Disease Control (CDC) prediabetes screening tool was the benchmark to compare model performance.<br>
Results: Prediabetes prevalence was 23.43%. The CDC prediabetes screening tool produced 64.40% AUROC. Seven optimal (≥ 70% AUROC) models identified 25 predictors including 4 potentially novel associations; 20 by both logistic and other nonlinear/ensemble models and 5 solely by the latter. All optimal models outperformed the CDC prediabetes screening tool (P &lt; 0.05).<br>
Discussion: Combined use of feature selection and machine learning increased predictive performance outperforming the recommended screening tool. A range of predictors of prediabetes was identified.<br>
<br>
Conclusion: This work demonstrated the value of combining feature selection with machine learning to identify a wide range of predictors that could enhance prediabetes prediction and clinical decision-making.}},
  author       = {{De Silva, Kushan and Jönsson, Daniel and Demmer, Ryan T}},
  issn         = {{1527-974X}},
  language     = {{eng}},
  month        = {{03}},
  number       = {{3}},
  pages        = {{369--406}},
  publisher    = {{Oxford University Press}},
  series       = {{Journal of American Medical Informatics Association}},
  title        = {{A combined strategy of feature selection and machine learning to identify predictors of prediabetes.}},
  url          = {{http://dx.doi.org/10.1093/jamia/ocz204}},
  doi          = {{10.1093/jamia/ocz204}},
  volume       = {{27}},
  year         = {{2020}},
}