Improving Missing Data Imputation using Generative Adversarial Network-based Methods

Anderberg, Hanna; Wadell, Sofia

Improving Missing Data Imputation using Generative Adversarial Network-based Methods

Mark

Anderberg, Hanna ^LU and Wadell, Sofia (2023) In Master's Theses in Mathematical Sciences FMSM01 20231
Mathematical Statistics

Abstract: In a modern context, organizations increasingly rely on data analysis and the importance of data quality have accordingly become even more crucial. In this context, missing values pose a significant challenge compromising the utility of the data. In an ideal scenario data should be collected in a way so that the missing values are avoided, but practical and cost constraints often make this unfeasible. Consequently, various approaches have been developed to address the issue of missing values. Rather than discarding incomplete observations and compromising the sample size, imputing the missing values has the potential to improve predictions and imputation outcomes. Furthermore, it is a relatively straightforward process in terms of cost and... (More); In a modern context, organizations increasingly rely on data analysis and the importance of data quality have accordingly become even more crucial. In this context, missing values pose a significant challenge compromising the utility of the data. In an ideal scenario data should be collected in a way so that the missing values are avoided, but practical and cost constraints often make this unfeasible. Consequently, various approaches have been developed to address the issue of missing values. Rather than discarding incomplete observations and compromising the sample size, imputing the missing values has the potential to improve predictions and imputation outcomes. Furthermore, it is a relatively straightforward process in terms of cost and effort.

In addition to this, Generative Adversarial Networks (GANs) have lately gained attention as a recent breakthrough in machine learning, offering novel possibilities for data handling. This study explores two aspects in which GANs can potentially can improve data imputation. Firstly, the performance of an imputation-focused GAN model, GAIN, is compared against other state-of-the-art methods through an extensive evaluation. Secondly, the impact of incorporating synthesized data, generated by a GAN framework named CTGAN, into the training data of imputation models is evaluated.

Our findings reveal that GAIN was outperformed by other data imputation methods. Despite this, its potential is not questioned, as further optimization of hyperparameters and network structure specific to the data set is believed to enhance its performance. The result of this study however emphasizes the clear challenges of the time-consuming training and optimization processes of GANs in general.

Conversely, the additional data generated by CTGAN had a significant positive impact on the result of kNN imputation. Not only does the additional data strenghtens kNN imputation's position as the most prominent method in the study in terms of predictive performance, but it also serves as the most significant contribution from this report as the methodology has not been examined in previous research. Further, the practical feasibility of the method combined with its strong results makes it suitable for practical applications. To sum up, the findings underscore the potential for further enhancements in data imputation using GANs. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9117454

author

Anderberg, Hanna ^LU and Wadell, Sofia

supervisor

Ted Kronvall ^LU

organization

Mathematical Statistics

course

FMSM01 20231

year

2023

type

H2 - Master's Degree (Two Years)

subject

Mathematics and Statistics

keywords

Missing Values, Data Imputation, Generative Adversarial Network, GAIN, CTGAN

publication/series

Master's Theses in Mathematical Sciences

report number

LUTFMS-3474-2023

ISSN

1404-6342

other publication id

2023:E23

language

English

id

9117454

date added to LUP

2023-05-31 16:41:30

date last changed

2023-06-02 13:03:32

@misc{9117454,
  abstract     = {{In a modern context, organizations increasingly rely on data analysis and the importance of data quality have accordingly become even more crucial. In this context, missing values pose a significant challenge compromising the utility of the data. In an ideal scenario data should be collected in a way so that the missing values are avoided, but practical and cost constraints often make this unfeasible. Consequently, various approaches have been developed to address the issue of missing values. Rather than discarding incomplete observations and compromising the sample size, imputing the missing values has the potential to improve predictions and imputation outcomes. Furthermore, it is a relatively straightforward process in terms of cost and effort.

In addition to this, Generative Adversarial Networks (GANs) have lately gained attention as a recent breakthrough in machine learning, offering novel possibilities for data handling. This study explores two aspects in which GANs can potentially can improve data imputation. Firstly, the performance of an imputation-focused GAN model, GAIN, is compared against other state-of-the-art methods through an extensive evaluation. Secondly, the impact of incorporating synthesized data, generated by a GAN framework named CTGAN, into the training data of imputation models is evaluated. 

Our findings reveal that GAIN was outperformed by other data imputation methods. Despite this, its potential is not questioned, as further optimization of hyperparameters and network structure specific to the data set is believed to enhance its performance. The result of this study however emphasizes the clear challenges of the time-consuming training and optimization processes of GANs in general.

Conversely, the additional data generated by CTGAN had a significant positive impact on the result of kNN imputation. Not only does the additional data strenghtens kNN imputation's position as the most prominent method in the study in terms of predictive performance, but it also serves as the most significant contribution from this report as the methodology has not been examined in previous research. Further, the practical feasibility of the method combined with its strong results makes it suitable for practical applications. To sum up, the findings underscore the potential for further enhancements in data imputation using GANs.}},
  author       = {{Anderberg, Hanna and Wadell, Sofia}},
  issn         = {{1404-6342}},
  language     = {{eng}},
  note         = {{Student Paper}},
  series       = {{Master's Theses in Mathematical Sciences}},
  title        = {{Improving Missing Data Imputation using Generative Adversarial Network-based Methods}},
  year         = {{2023}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Improving Missing Data Imputation using Generative Adversarial Network-based Methods