Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Reproducibility in Diabetes Research Articles Using Machine Learning Classifiers

Aliverdi, Faezeh LU (2024) STAN40 20241
Department of Statistics
Abstract
Diabetes is a chronic disease that affects lots of people all around the world. It greatly impacts the health of those with it and places many demands on healthcare services. Early detection of diabetes can significantly improve treatment outcomes, lower the chances of future health issues, and help patients to have a better healthier life.
In this study, I evaluated the reproducibility of the performance of various machine learning (ML) classifiers from the four articles in predicting diabetes using two datasets: the updated Pima Indians Diabetes Database and the Diabetes Health Indicator from the Behavioral Risk Factor Surveillance System. Reproducing results in medical ML research is crucial for validating accuracy, ensuring... (More)
Diabetes is a chronic disease that affects lots of people all around the world. It greatly impacts the health of those with it and places many demands on healthcare services. Early detection of diabetes can significantly improve treatment outcomes, lower the chances of future health issues, and help patients to have a better healthier life.
In this study, I evaluated the reproducibility of the performance of various machine learning (ML) classifiers from the four articles in predicting diabetes using two datasets: the updated Pima Indians Diabetes Database and the Diabetes Health Indicator from the Behavioral Risk Factor Surveillance System. Reproducing results in medical ML research is crucial for validating accuracy, ensuring generalizability, and building trust. The medical community can integrate ML into disease prediction and treatment protocols more safely and effectively by testing and reproducing results.
Using thorough data preprocessing methods, I prepared the datasets for analysis. Then, I utilized a range of ML algorithms, including support vector machine, decision trees, random forest, naive Bayes, logistic regression, and k-nearest neighbors to predict the diabetes cases on these datasets. I compared the performance of ML algorithms in terms of accuracy, precision, recall, F1-score, specificity, and area under the curve. The experimental results demonstrated that the random forest method is the most effective across different algorithms and datasets. In addition, the results show that I could achieve approximately similar results to previous articles on these datasets, supported by the outcomes of different ML algorithms. The slight differences between my findings and those in previous articles may be due to updates in the Pima Indians Diabetes Database and the specific methods I used for data preparation. (Less)
Please use this url to cite or link to this publication:
author
Aliverdi, Faezeh LU
supervisor
organization
course
STAN40 20241
year
type
H1 - Master's Degree (One Year)
subject
language
English
id
9160319
date added to LUP
2024-06-17 14:19:33
date last changed
2024-06-17 14:19:33
@misc{9160319,
  abstract     = {{Diabetes is a chronic disease that affects lots of people all around the world. It greatly impacts the health of those with it and places many demands on healthcare services. Early detection of diabetes can significantly improve treatment outcomes, lower the chances of future health issues, and help patients to have a better healthier life.
In this study, I evaluated the reproducibility of the performance of various machine learning (ML) classifiers from the four articles in predicting diabetes using two datasets: the updated Pima Indians Diabetes Database and the Diabetes Health Indicator from the Behavioral Risk Factor Surveillance System. Reproducing results in medical ML research is crucial for validating accuracy, ensuring generalizability, and building trust. The medical community can integrate ML into disease prediction and treatment protocols more safely and effectively by testing and reproducing results. 
Using thorough data preprocessing methods, I prepared the datasets for analysis. Then, I utilized a range of ML algorithms, including support vector machine, decision trees, random forest, naive Bayes, logistic regression, and k-nearest neighbors to predict the diabetes cases on these datasets. I compared the performance of ML algorithms in terms of accuracy, precision, recall, F1-score, specificity, and area under the curve. The experimental results demonstrated that the random forest method is the most effective across different algorithms and datasets. In addition, the results show that I could achieve approximately similar results to previous articles on these datasets, supported by the outcomes of different ML algorithms. The slight differences between my findings and those in previous articles may be due to updates in the Pima Indians Diabetes Database and the specific methods I used for data preparation.}},
  author       = {{Aliverdi, Faezeh}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Reproducibility in Diabetes Research Articles Using Machine Learning Classifiers}},
  year         = {{2024}},
}