Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Improving Missing Data Imputation using Generative Adversarial Network-based Methods

Anderberg, Hanna LU and Wadell, Sofia (2023) In Master's Theses in Mathematical Sciences FMSM01 20231
Mathematical Statistics
Abstract
In a modern context, organizations increasingly rely on data analysis and the importance of data quality have accordingly become even more crucial. In this context, missing values pose a significant challenge compromising the utility of the data. In an ideal scenario data should be collected in a way so that the missing values are avoided, but practical and cost constraints often make this unfeasible. Consequently, various approaches have been developed to address the issue of missing values. Rather than discarding incomplete observations and compromising the sample size, imputing the missing values has the potential to improve predictions and imputation outcomes. Furthermore, it is a relatively straightforward process in terms of cost and... (More)
In a modern context, organizations increasingly rely on data analysis and the importance of data quality have accordingly become even more crucial. In this context, missing values pose a significant challenge compromising the utility of the data. In an ideal scenario data should be collected in a way so that the missing values are avoided, but practical and cost constraints often make this unfeasible. Consequently, various approaches have been developed to address the issue of missing values. Rather than discarding incomplete observations and compromising the sample size, imputing the missing values has the potential to improve predictions and imputation outcomes. Furthermore, it is a relatively straightforward process in terms of cost and effort.

In addition to this, Generative Adversarial Networks (GANs) have lately gained attention as a recent breakthrough in machine learning, offering novel possibilities for data handling. This study explores two aspects in which GANs can potentially can improve data imputation. Firstly, the performance of an imputation-focused GAN model, GAIN, is compared against other state-of-the-art methods through an extensive evaluation. Secondly, the impact of incorporating synthesized data, generated by a GAN framework named CTGAN, into the training data of imputation models is evaluated.

Our findings reveal that GAIN was outperformed by other data imputation methods. Despite this, its potential is not questioned, as further optimization of hyperparameters and network structure specific to the data set is believed to enhance its performance. The result of this study however emphasizes the clear challenges of the time-consuming training and optimization processes of GANs in general.

Conversely, the additional data generated by CTGAN had a significant positive impact on the result of kNN imputation. Not only does the additional data strenghtens kNN imputation's position as the most prominent method in the study in terms of predictive performance, but it also serves as the most significant contribution from this report as the methodology has not been examined in previous research. Further, the practical feasibility of the method combined with its strong results makes it suitable for practical applications. To sum up, the findings underscore the potential for further enhancements in data imputation using GANs. (Less)
Please use this url to cite or link to this publication:
author
Anderberg, Hanna LU and Wadell, Sofia
supervisor
organization
course
FMSM01 20231
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Missing Values, Data Imputation, Generative Adversarial Network, GAIN, CTGAN
publication/series
Master's Theses in Mathematical Sciences
report number
LUTFMS-3474-2023
ISSN
1404-6342
other publication id
2023:E23
language
English
id
9117454
date added to LUP
2023-05-31 16:41:30
date last changed
2023-06-02 13:03:32
@misc{9117454,
  abstract     = {{In a modern context, organizations increasingly rely on data analysis and the importance of data quality have accordingly become even more crucial. In this context, missing values pose a significant challenge compromising the utility of the data. In an ideal scenario data should be collected in a way so that the missing values are avoided, but practical and cost constraints often make this unfeasible. Consequently, various approaches have been developed to address the issue of missing values. Rather than discarding incomplete observations and compromising the sample size, imputing the missing values has the potential to improve predictions and imputation outcomes. Furthermore, it is a relatively straightforward process in terms of cost and effort.

In addition to this, Generative Adversarial Networks (GANs) have lately gained attention as a recent breakthrough in machine learning, offering novel possibilities for data handling. This study explores two aspects in which GANs can potentially can improve data imputation. Firstly, the performance of an imputation-focused GAN model, GAIN, is compared against other state-of-the-art methods through an extensive evaluation. Secondly, the impact of incorporating synthesized data, generated by a GAN framework named CTGAN, into the training data of imputation models is evaluated. 

Our findings reveal that GAIN was outperformed by other data imputation methods. Despite this, its potential is not questioned, as further optimization of hyperparameters and network structure specific to the data set is believed to enhance its performance. The result of this study however emphasizes the clear challenges of the time-consuming training and optimization processes of GANs in general.

Conversely, the additional data generated by CTGAN had a significant positive impact on the result of kNN imputation. Not only does the additional data strenghtens kNN imputation's position as the most prominent method in the study in terms of predictive performance, but it also serves as the most significant contribution from this report as the methodology has not been examined in previous research. Further, the practical feasibility of the method combined with its strong results makes it suitable for practical applications. To sum up, the findings underscore the potential for further enhancements in data imputation using GANs.}},
  author       = {{Anderberg, Hanna and Wadell, Sofia}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Improving Missing Data Imputation using Generative Adversarial Network-based Methods}},
  year         = {{2023}},
}