Klusteranalys och bortfall: en studie i hur klusteranalys påverkas av imputation för variabelbortfall

Segersäll, Viktor; Berndtsson, Filip

Klusteranalys och bortfall: en studie i hur klusteranalys påverkas av imputation för variabelbortfall

Mark

Segersäll, Viktor ^LU and Berndtsson, Filip ^LU (2021) STAH11 20202
Department of Statistics

Abstract: The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values.
Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different... (More); The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values.
Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different techniques for imputation of missing values: imputation with the sample mean, imputation with the sample median and imputation with MICE.
Thirdly, we have estimated the standard error for each imputation technique as a unit of measurement in order to draw conclusions about which imputation technique minimizes the errors and hence could be the optimal imputation technique for a cluster analysis.
Our results, based on the least estimate of the standard error, is that the MICE imputation tends to generate the best estimate compared to a dataset without missing values. A more simple imputation technique, such as the sample mean or median, could nonetheless be considered if the assumptions for the MICE-technique are not fulfilled or the data set is small. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9042567

author

Segersäll, Viktor ^LU and Berndtsson, Filip ^LU

supervisor

Jonas Wallin ^LU

organization

Department of Statistics

course

STAH11 20202

year

2021

type

M2 - Bachelor Degree

subject

Mathematics and Statistics

keywords

Cluster analysis, missing values, imputation

language

Swedish

id

9042567

date added to LUP

2023-02-14 11:42:54

date last changed

2023-02-14 11:42:54

@misc{9042567,
  abstract     = {{The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values. 
 Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different techniques for imputation of missing values: imputation with the sample mean, imputation with the sample median and imputation with MICE. 
 Thirdly, we have estimated the standard error for each imputation technique as a unit of measurement in order to draw conclusions about which imputation technique minimizes the errors and hence could be the optimal imputation technique for a cluster analysis.
 Our results, based on the least estimate of the standard error, is that the MICE imputation tends to generate the best estimate compared to a dataset without missing values. A more simple imputation technique, such as the sample mean or median, could nonetheless be considered if the assumptions for the MICE-technique are not fulfilled or the data set is small.}},
  author       = {{Segersäll, Viktor and Berndtsson, Filip}},
  language     = {{swe}},
  note         = {{Student Paper}},
  title        = {{Klusteranalys och bortfall: en studie i hur klusteranalys påverkas av imputation för variabelbortfall}},
  year         = {{2021}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Klusteranalys och bortfall: en studie i hur klusteranalys påverkas av imputation för variabelbortfall