Structures in High-Dimensional Data: Intrinsic Dimension and Cluster Analysis

Johnsson, Kerstin

Structures in High-Dimensional Data: Intrinsic Dimension and Cluster Analysis

Mark

Johnsson, Kerstin ^LU (2016)

Abstract: With today's improved measurement and data storing technologies it has become common to collect data in search for hypotheses instead of for testing hypotheses---to do exploratory data analysis. Finding patterns and structures in data is the main goal. This thesis deals with two kinds of structures that can convey relationships between different parts of data in a high-dimensional space: manifolds and clusters. They are in a way opposites of each other: a manifold structure shows that it is plausible to connect two distant points through the manifold, a clustering shows that it is plausible to separate two nearby points by assigning them to different clusters. But clusters and manifolds can also be the same: each cluster can be a manifold... (More); With today's improved measurement and data storing technologies it has become common to collect data in search for hypotheses instead of for testing hypotheses---to do exploratory data analysis. Finding patterns and structures in data is the main goal. This thesis deals with two kinds of structures that can convey relationships between different parts of data in a high-dimensional space: manifolds and clusters. They are in a way opposites of each other: a manifold structure shows that it is plausible to connect two distant points through the manifold, a clustering shows that it is plausible to separate two nearby points by assigning them to different clusters. But clusters and manifolds can also be the same: each cluster can be a manifold of its own.

The first paper in this thesis concerns one specific aspect of a manifold structure, namely its dimension, also called the intrinsic dimension of the data. A novel estimator of intrinsic dimension, taking advantage of ``the curse of dimensionality'', is proposed and evaluated. It is shown that it has in general less bias than estimators from the literature and can therefore better distinguish manifolds with different dimensions.

The second and third paper in this thesis concern cluster analysis of data generated by flow cytometry---a high-throughput single-cell measurement technology. In this area, clustering is performed routinely by manual assignment of data in two-dimensional plots, to identify cell populations. It is a tedious and subjective task, especially since data often has four, eight, twelve or even more dimensions, and the analysts need to decide which two dimensions to look at together, and in which order.

In the second paper of the thesis a new pipeline for automated cell population identification is proposed, which can process multiple flow cytometry samples in parallel using a hierarchical model that shares information between the clusterings of the samples, thus making corresponding clusters in different samples similar while allowing for variation in cluster location and shape.

In the third and final paper of the thesis, statistical tests for unimodality are investigated as a tool for quality control of automated cell population identification algorithms. It is shown that the different tests have different interpretations of unimodality and thus accept different kinds of clusters as sufficiently close to unimodal. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/8404f72e-e760-436d-ad7f-1be15af4b3d1

author

Johnsson, Kerstin ^LU

supervisor

Magnus Fontes ^LU

opponent

Dr. Benno Schwikowski, Institut Pasteur Paris, France

organization

publishing date

2016-08-16

type

Thesis

publication status

published

subject

Computational Mathematics

edition

150

pages

188 pages

publisher

Centre for Mathematical Sciences, Lund University

defense location

Lecture hall MA:1, Annexet, Sölvegatan 20, Lund University, Faculty of Engineering

defense date

2016-09-09 13:15:00

ISBN

978-91-7623-920-9

978-91-7623-921-6

language

English

LU publication?

yes

id

8404f72e-e760-436d-ad7f-1be15af4b3d1

date added to LUP

2016-08-16 14:41:10

date last changed

2025-04-04 15:05:42

@phdthesis{8404f72e-e760-436d-ad7f-1be15af4b3d1,
  abstract     = {{With today's improved measurement and data storing technologies it has become common to collect data in search for hypotheses instead of for testing hypotheses---to do exploratory data analysis. Finding patterns and structures in data is the main goal. This thesis deals with two kinds of structures that can convey relationships between different parts of data in a high-dimensional space: manifolds and clusters. They are in a way opposites of each other: a manifold structure shows that it is plausible to connect two distant points through the manifold, a clustering shows that it is plausible to separate two nearby points by assigning them to different clusters. But clusters and manifolds can also be the same: each cluster can be a manifold of its own.<br/><br/>The first paper in this thesis concerns one specific aspect of a manifold structure, namely its dimension, also called the intrinsic dimension of the data. A novel estimator of intrinsic dimension, taking advantage of ``the curse of dimensionality'', is proposed and evaluated. It is shown that it has in general less bias than estimators from the literature and can therefore better distinguish manifolds with different dimensions.<br/><br/>The second and third paper in this thesis concern cluster analysis of data generated by flow cytometry---a high-throughput single-cell measurement technology. In this area, clustering is performed routinely by manual assignment of data in two-dimensional plots, to identify cell populations. It is a tedious and subjective task, especially since data often has four, eight, twelve or even more dimensions, and the analysts need to decide which two dimensions to look at together, and in which order.<br/><br/>In the second paper of the thesis a new pipeline for automated cell population identification is proposed, which can process multiple flow cytometry samples in parallel using a hierarchical model that shares information between the clusterings of the samples, thus making corresponding clusters in different samples similar while allowing for variation in cluster location and shape.<br/><br/>In the third and final paper of the thesis, statistical tests for unimodality are investigated as a tool for quality control of automated cell population identification algorithms. It is shown that the different tests have different interpretations of unimodality and thus accept different kinds of clusters as sufficiently close to unimodal.}},
  author       = {{Johnsson, Kerstin}},
  isbn         = {{978-91-7623-920-9}},
  language     = {{eng}},
  month        = {{08}},
  publisher    = {{Centre for Mathematical Sciences, Lund University}},
  school       = {{Lund University}},
  title        = {{Structures in High-Dimensional Data: Intrinsic Dimension and Cluster Analysis}},
  url          = {{https://lup.lub.lu.se/search/files/10994514/Kerstin_Johnsson_PhD_thesis.pdf}},
  year         = {{2016}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Structures in High-Dimensional Data: Intrinsic Dimension and Cluster Analysis