Advanced

Predictive modelling using a nationally representative database to identify the determinants of prediabetes; a machine learning analytic approach on the National Health and Nutrition Examination Survey (NHANES) 2013-2014

Ranakombu, Kushan Kumara de Silva LU (2018) MPHN40 20181
Social Medicine and Global Health
Abstract
ABSTRACT
Background: Prediabetes is a global epidemic with rising prevalence rates, but its diagnosis based on traditional risk factors is challenging. Application of novel machine-intelligence based methods to public health databases could provide valuable insights into the disease process.
Aim: To build predictive models to elucidate the determinants of prediabetes using machine learning algorithms on a nationally representative sample of the US population.
Method: Two datasets containing general (n = 6346) and dental (n = 3167) variables were prepared from the National Health and Nutrition Examination Survey (NHANES) 2013-2014 and were randomly partitioned to create train and internal validation data. Feature selection algorithms... (More)
ABSTRACT
Background: Prediabetes is a global epidemic with rising prevalence rates, but its diagnosis based on traditional risk factors is challenging. Application of novel machine-intelligence based methods to public health databases could provide valuable insights into the disease process.
Aim: To build predictive models to elucidate the determinants of prediabetes using machine learning algorithms on a nationally representative sample of the US population.
Method: Two datasets containing general (n = 6346) and dental (n = 3167) variables were prepared from the National Health and Nutrition Examination Survey (NHANES) 2013-2014 and were randomly partitioned to create train and internal validation data. Feature selection algorithms were run on the train (n = 3174) data containing 156 pre-selected general variables. Five machine learning algorithms were applied on train data containing general (n = 3174) and dental (n = 1584) variables as well as on re-sampled datasets built using 4 resampling methods. Predictive models were tested on internal validation data containing general (n = 3172) and dental (n = 1583) variables. External validation was done on 2 datasets containing general (n = 3000) and dental (n = 1500) variables prepared from the NHANES 2011-2012. Model performance was evaluated using area under the receiver operating characteristic curve (AUC). Determinants were elucidated by odds ratios in logistic regression models and by variable importance values in other algorithms. The CDC prediabetes screening tool was chosen as the benchmark against which the performance of optimal models was compared.
Results: Seven optimal (>70% AUC) models built on the dataset containing general variables elucidated 25 determinants of prediabetes including a few novel associations; 20 were identified by both logistic regression and other non-linear/ensemble models while 5 were solely elucidated by the latter. Dental variables by themselves were not predictive of, and periodontitis appeared the only dental determinant of, prediabetes. The optimal machine learning model (AUC = 71.6%) built on the data containing general variables outperformed the chosen benchmark while that built on dental data equaled the performance of the screening tool.
Conclusion: A range of determinants of prediabetes was identified through validated and benchmarked models highlighting the potential of a systematic, machine intelligence-based modelling approach on a public health database to elucidate the determinants of prediabetes including novel predictors.
Keywords: prediabetes, determinants, machine learning, feature selection, NHANES (Less)
Popular Abstract
POPULAR SCIENCE SUMMARY
Machine learning is a discipline that involves a combined use of high computing power of modern computers and statistical techniques. It has been proven that machine learning applications on large medical databases can provide novel insights into various health issues. This study aimed at applying machine learning techniques on a large public health database, namely, the National Health and Nutrition Examination Survey (NHANES) 2013-2014, to identify multiple factors that may affect the development of prediabetes, which is a common disease across the world. Since it is a reversible condition, if identified early, the progression to diabetes can be prevented, and normal blood glucose levels can be achieved.... (More)
POPULAR SCIENCE SUMMARY
Machine learning is a discipline that involves a combined use of high computing power of modern computers and statistical techniques. It has been proven that machine learning applications on large medical databases can provide novel insights into various health issues. This study aimed at applying machine learning techniques on a large public health database, namely, the National Health and Nutrition Examination Survey (NHANES) 2013-2014, to identify multiple factors that may affect the development of prediabetes, which is a common disease across the world. Since it is a reversible condition, if identified early, the progression to diabetes can be prevented, and normal blood glucose levels can be achieved. Nevertheless, timely identification of prediabetes is difficult and current screening tools based on a limited number of traditional risk factors may often fail to identify many prediabetic individuals.
Through a machine learning analytic approach, we identified an array of socio-economic, clinical, biochemical, and dental factors influencing prediabetes, many of which had been reported in previous studies. Interestingly, the study further revealed that several known diabetes risk markers may be potential indicators of prediabetes as well, providing new evidence and directions for future research of the disease. The findings indicate that routinely-collected NHANES data by questionnaires and simple tests such as a person’s body measurements, vigorous exercise level, blood lipid level, blood pressure, various blood cell measurements, liver function profile and gum disease can help identify people that are highly likely to develop prediabetes and may complement standard prediabetes risk assessment tools.
However, owing to the limitations of the study design, findings do not confirm that these factors “cause” prediabetes. Further research is warranted to consolidate the findings of the present study which, with a higher level of evidence, may eventually be found useful for clinical diagnosis and community-based screening of prediabetes. (Less)
Please use this url to cite or link to this publication:
author
Ranakombu, Kushan Kumara de Silva LU
supervisor
organization
course
MPHN40 20181
year
type
H2 - Master's Degree (Two Years)
subject
keywords
prediabetes, determinants, machine learning, feature selection, NHANES
language
English
id
8955158
date added to LUP
2018-08-06 08:50:52
date last changed
2018-08-06 08:50:52
@misc{8955158,
  abstract     = {ABSTRACT
Background: Prediabetes is a global epidemic with rising prevalence rates, but its diagnosis based on traditional risk factors is challenging. Application of novel machine-intelligence based methods to public health databases could provide valuable insights into the disease process.
Aim: To build predictive models to elucidate the determinants of prediabetes using machine learning algorithms on a nationally representative sample of the US population.
Method: Two datasets containing general (n = 6346) and dental (n = 3167) variables were prepared from the National Health and Nutrition Examination Survey (NHANES) 2013-2014 and were randomly partitioned to create train and internal validation data. Feature selection algorithms were run on the train (n = 3174) data containing 156 pre-selected general variables. Five machine learning algorithms were applied on train data containing general (n = 3174) and dental (n = 1584) variables as well as on re-sampled datasets built using 4 resampling methods. Predictive models were tested on internal validation data containing general (n = 3172) and dental (n = 1583) variables. External validation was done on 2 datasets containing general (n = 3000) and dental (n = 1500) variables prepared from the NHANES 2011-2012. Model performance was evaluated using area under the receiver operating characteristic curve (AUC). Determinants were elucidated by odds ratios in logistic regression models and by variable importance values in other algorithms. The CDC prediabetes screening tool was chosen as the benchmark against which the performance of optimal models was compared.
Results: Seven optimal (>70% AUC) models built on the dataset containing general variables elucidated 25 determinants of prediabetes including a few novel associations; 20 were identified by both logistic regression and other non-linear/ensemble models while 5 were solely elucidated by the latter. Dental variables by themselves were not predictive of, and periodontitis appeared the only dental determinant of, prediabetes. The optimal machine learning model (AUC = 71.6%) built on the data containing general variables outperformed the chosen benchmark while that built on dental data equaled the performance of the screening tool.
Conclusion: A range of determinants of prediabetes was identified through validated and benchmarked models highlighting the potential of a systematic, machine intelligence-based modelling approach on a public health database to elucidate the determinants of prediabetes including novel predictors.
Keywords: prediabetes, determinants, machine learning, feature selection, NHANES},
  author       = {Ranakombu, Kushan Kumara de Silva},
  keyword      = {prediabetes,determinants,machine learning,feature selection,NHANES},
  language     = {eng},
  note         = {Student Paper},
  title        = {Predictive modelling using a nationally representative database to identify the determinants of prediabetes; a machine learning analytic approach on the National Health and Nutrition Examination Survey (NHANES) 2013-2014},
  year         = {2018},
}