Klusteranalys och bortfall: en studie i hur klusteranalys påverkas av imputation för variabelbortfall
(2021) STAH11 20202Department of Statistics
- Abstract
- The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values.
Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different... (More) - The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values.
Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different techniques for imputation of missing values: imputation with the sample mean, imputation with the sample median and imputation with MICE.
Thirdly, we have estimated the standard error for each imputation technique as a unit of measurement in order to draw conclusions about which imputation technique minimizes the errors and hence could be the optimal imputation technique for a cluster analysis.
Our results, based on the least estimate of the standard error, is that the MICE imputation tends to generate the best estimate compared to a dataset without missing values. A more simple imputation technique, such as the sample mean or median, could nonetheless be considered if the assumptions for the MICE-technique are not fulfilled or the data set is small. (Less)
Please use this url to cite or link to this publication:
http://lup.lub.lu.se/student-papers/record/9042567
- author
- Segersäll, Viktor LU and Berndtsson, Filip LU
- supervisor
-
- Jonas Wallin LU
- organization
- course
- STAH11 20202
- year
- 2021
- type
- M2 - Bachelor Degree
- subject
- keywords
- Cluster analysis, missing values, imputation
- language
- Swedish
- id
- 9042567
- date added to LUP
- 2023-02-14 11:42:54
- date last changed
- 2023-02-14 11:42:54
@misc{9042567, abstract = {{The combination of item non response and cluster analysis is a field which has not been explored to its full extent. This study aims to investigate which consequences a data set with missing values has on the algorithm of cluster analysis and which model for imputation should be suggested in order to minimize standard error and the optimal number of clusters. Basing our research on principal component analysis, k-mean clustering and the average silhouett model, we intend to examine the depth of cluster analysis and how it handles missing values. Firstly, we have performed cluster analysis on three different datasets, applying the average silhouette model to estimate the number of clusters. Secondly, we have used three different techniques for imputation of missing values: imputation with the sample mean, imputation with the sample median and imputation with MICE. Thirdly, we have estimated the standard error for each imputation technique as a unit of measurement in order to draw conclusions about which imputation technique minimizes the errors and hence could be the optimal imputation technique for a cluster analysis. Our results, based on the least estimate of the standard error, is that the MICE imputation tends to generate the best estimate compared to a dataset without missing values. A more simple imputation technique, such as the sample mean or median, could nonetheless be considered if the assumptions for the MICE-technique are not fulfilled or the data set is small.}}, author = {{Segersäll, Viktor and Berndtsson, Filip}}, language = {{swe}}, note = {{Student Paper}}, title = {{Klusteranalys och bortfall: en studie i hur klusteranalys påverkas av imputation för variabelbortfall}}, year = {{2021}}, }