Neural network training with highly incomplete medical datasets

Chang, Yu Wei; Natali, Laura; Jamialahmadi, Oveis; Romeo, Stefano; Pereira, Joana B.; Volpe, Giovanni

Neural network training with highly incomplete medical datasets

Mark

Chang, Yu Wei ; Natali, Laura ; Jamialahmadi, Oveis ; Romeo, Stefano ; Pereira, Joana B. ^LU and Volpe, Giovanni (2022) In Machine Learning: Science and Technology 3(3).

Abstract: Neural network training and validation rely on the availability of large high-quality datasets. However, in many cases only incomplete datasets are available, particularly in health care applications, where each patient typically undergoes different clinical procedures or can drop out of a study. Since the data to train the neural networks need to be complete, most studies discard the incomplete datapoints, which reduces the size of the training data, or impute the missing features, which can lead to artifacts. Alas, both approaches are inadequate when a large portion of the data is missing. Here, we introduce GapNet, an alternative deep-learning training approach that can use highly incomplete datasets without overfitting or... (More); Neural network training and validation rely on the availability of large high-quality datasets. However, in many cases only incomplete datasets are available, particularly in health care applications, where each patient typically undergoes different clinical procedures or can drop out of a study. Since the data to train the neural networks need to be complete, most studies discard the incomplete datapoints, which reduces the size of the training data, or impute the missing features, which can lead to artifacts. Alas, both approaches are inadequate when a large portion of the data is missing. Here, we introduce GapNet, an alternative deep-learning training approach that can use highly incomplete datasets without overfitting or introducing artefacts. First, the dataset is split into subsets of samples containing all values for a certain cluster of features. Then, these subsets are used to train individual neural networks. Finally, this ensemble of neural networks is combined into a single neural network whose training is fine-tuned using all complete datapoints. Using two highly incomplete real-world medical datasets, we show that GapNet improves the identification of patients with underlying Alzheimer's disease pathology and of patients at risk of hospitalization due to Covid-19. Compared to commonly used imputation methods, this improvement suggests that GapNet can become a general tool to handle incomplete medical datasets.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/6280781a-391b-415b-8d6c-3b22c95b8d02

author

Chang, Yu Wei ; Natali, Laura ; Jamialahmadi, Oveis ; Romeo, Stefano ; Pereira, Joana B. ^LU and Volpe, Giovanni

author collaboration

Alzheimer's Disease Neuroimaging Initiative

organization

publishing date

2022-09-01

type

Contribution to journal

publication status

published

subject

Bioinformatics (Computational Biology)

keywords

Alzheimer disease, Covid-19, incomplete datasets, neural networks

in

Machine Learning: Science and Technology

volume

3

issue

3

article number

035001

publisher

IOP Publishing

external identifiers

scopus:85134694123

ISSN

2632-2153

DOI

10.1088/2632-2153/ac7b69

language

English

LU publication?

yes

id

6280781a-391b-415b-8d6c-3b22c95b8d02

date added to LUP

2022-10-24 14:48:54

date last changed

2025-06-13 18:19:43

@article{6280781a-391b-415b-8d6c-3b22c95b8d02,
  abstract     = {{<p>Neural network training and validation rely on the availability of large high-quality datasets. However, in many cases only incomplete datasets are available, particularly in health care applications, where each patient typically undergoes different clinical procedures or can drop out of a study. Since the data to train the neural networks need to be complete, most studies discard the incomplete datapoints, which reduces the size of the training data, or impute the missing features, which can lead to artifacts. Alas, both approaches are inadequate when a large portion of the data is missing. Here, we introduce GapNet, an alternative deep-learning training approach that can use highly incomplete datasets without overfitting or introducing artefacts. First, the dataset is split into subsets of samples containing all values for a certain cluster of features. Then, these subsets are used to train individual neural networks. Finally, this ensemble of neural networks is combined into a single neural network whose training is fine-tuned using all complete datapoints. Using two highly incomplete real-world medical datasets, we show that GapNet improves the identification of patients with underlying Alzheimer's disease pathology and of patients at risk of hospitalization due to Covid-19. Compared to commonly used imputation methods, this improvement suggests that GapNet can become a general tool to handle incomplete medical datasets.</p>}},
  author       = {{Chang, Yu Wei and Natali, Laura and Jamialahmadi, Oveis and Romeo, Stefano and Pereira, Joana B. and Volpe, Giovanni}},
  issn         = {{2632-2153}},
  keywords     = {{Alzheimer disease; Covid-19; incomplete datasets; neural networks}},
  language     = {{eng}},
  month        = {{09}},
  number       = {{3}},
  publisher    = {{IOP Publishing}},
  series       = {{Machine Learning: Science and Technology}},
  title        = {{Neural network training with highly incomplete medical datasets}},
  url          = {{http://dx.doi.org/10.1088/2632-2153/ac7b69}},
  doi          = {{10.1088/2632-2153/ac7b69}},
  volume       = {{3}},
  year         = {{2022}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Neural network training with highly incomplete medical datasets