Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Klusteranalys och bortfall: en studie i hur klusteranalys påverkas av imputation för variabelbortfall

Segersäll, Viktor LU and Berndtsson, Filip LU (2021) STAH11 20202
Department of Statistics
Abstract
The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values.
Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different... (More)
The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values.
Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different techniques for imputation of missing values: imputation with the sample mean, imputation with the sample median and imputation with MICE.
Thirdly, we have estimated the standard error for each imputation technique as a unit of measurement in order to draw conclusions about which imputation technique minimizes the errors and hence could be the optimal imputation technique for a cluster analysis.
Our results, based on the least estimate of the standard error, is that the MICE imputation tends to generate the best estimate compared to a dataset without missing values. A more simple imputation technique, such as the sample mean or median, could nonetheless be considered if the assumptions for the MICE-technique are not fulfilled or the data set is small. (Less)
Please use this url to cite or link to this publication:
author
Segersäll, Viktor LU and Berndtsson, Filip LU
supervisor
organization
course
STAH11 20202
year
type
M2 - Bachelor Degree
subject
keywords
Cluster analysis, missing values, imputation
language
Swedish
id
9042567
date added to LUP
2023-02-14 11:42:54
date last changed
2023-02-14 11:42:54
@misc{9042567,
  abstract     = {{The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values. 
 Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different techniques for imputation of missing values: imputation with the sample mean, imputation with the sample median and imputation with MICE. 
 Thirdly, we have estimated the standard error for each imputation technique as a unit of measurement in order to draw conclusions about which imputation technique minimizes the errors and hence could be the optimal imputation technique for a cluster analysis.
 Our results, based on the least estimate of the standard error, is that the MICE imputation tends to generate the best estimate compared to a dataset without missing values. A more simple imputation technique, such as the sample mean or median, could nonetheless be considered if the assumptions for the MICE-technique are not fulfilled or the data set is small.}},
  author       = {{Segersäll, Viktor and Berndtsson, Filip}},
  language     = {{swe}},
  note         = {{Student Paper}},
  title        = {{Klusteranalys och bortfall: en studie i hur klusteranalys påverkas av imputation för variabelbortfall}},
  year         = {{2021}},
}