Advanced

Dimension Reduction and Signal Decomposition for Genotype–Phenotype Relations

Perby Henningsson, Rasmus LU (2017)
Abstract
Over the last few decades, DNA sequencing has developed from costing billions of dollars to get the complete sequence of the human genome, to being a routine procedure performed in labs all around the world. This has transformed the field of experimental biology since measurements can be done at a level of detail that was not possible before. Still, the relationship between genotype and low-level cellular processes on one hand, and high-level phenotypic traits on the other, tends to be very complex; measuring does not equal understanding. In the large data sets that are being gathered, it is often hard to uncover patterns that are truly meaningful, and not just arising by random chance.
In this work, we present novel methods for... (More)
Over the last few decades, DNA sequencing has developed from costing billions of dollars to get the complete sequence of the human genome, to being a routine procedure performed in labs all around the world. This has transformed the field of experimental biology since measurements can be done at a level of detail that was not possible before. Still, the relationship between genotype and low-level cellular processes on one hand, and high-level phenotypic traits on the other, tends to be very complex; measuring does not equal understanding. In the large data sets that are being gathered, it is often hard to uncover patterns that are truly meaningful, and not just arising by random chance.
In this work, we present novel methods for representing, exploring and visualizing genotype-phenotype data sets, with a particular focus on tracking changes driven by evolutionary processes as they occur. One challenge is to be able to quickly search for specific patterns in data coming from large genomes. We have adapted algorithms and data structures from the field of Information Retrieval, relying on inherent genomic structure to make efficient searches. In Paper I, we showcase these techniques with visualization of gene fusions in a study of paediatric B-cell precursor acute lymphoblastic leukaemia.
The complexity of biological processes, taken together with the fact that high-throughput measurements, such as DNA/RNA sequencing data, measure many different things at once, means that these data sets will often contain multiple overlaid signals. If data is collected in the field, rather than produced entirely under controlled conditions in the lab, it is practically unavoidable. In Paper III, we present SMSSVD – SubMatrix Selection Singular Value Decomposition, a parameter-free unsupervised signal decomposition and dimension reduction method, particularly useful for data sets with many variables. By adaptively reducing the noise for each signal, SMSSVD creates a representation with many desirable properties inherited from the ordinary SVD, while being able to discover signals closer to the limit of detection.
In Paper II and Paper IV we describe models for representing genetically related but still heterogeneous microbial populations and show how the composition of the population determines the interaction with the host. The DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) developed in Paper IV, covers the entire workflow from read alignment to visualization of results. We model each population as a positive measure over sequence space and apply SMSSVD to get a robust representation. Using our model, we follow and visualize the evolutionary trajectories of the populations through time, highlighting important minority variants emerging. Finally, we demonstrate the relevance of our population model by showing that it can accurately predict the population fitness, whereas a model based on the consensus sequence fails. (Less)
Please use this url to cite or link to this publication:
author
supervisor
opponent
  • Professor Krogh, Anders, University of Copenhagen, Denmark
organization
publishing date
type
Thesis
publication status
published
subject
keywords
Matematik, Bioinformatik, Dimensionsreduktion, Computational mathematics, Bioinformatics, Dimension reduction
pages
207 pages
publisher
Centre for Mathematical Sciences, Lund University
defense location
lecture hall Hörmandersalen, Centre for Mathematical Sciences, Sölvegatan 18, Lund University, Faculty of Engineering LTH, Lund
defense date
2018-02-02 13:15
ISBN
ISBN 978-91-7753-480-8
978-91-7753-479-2
language
English
LU publication?
yes
id
9e162841-3bb5-44da-846f-6bf9d095212f
date added to LUP
2018-01-08 11:52:35
date last changed
2018-01-09 12:59:39
@phdthesis{9e162841-3bb5-44da-846f-6bf9d095212f,
  abstract     = {Over the last few decades, DNA sequencing has developed from costing billions of dollars to get the complete sequence of the human genome, to being a routine procedure performed in labs all around the world. This has transformed the field of experimental biology since measurements can be done at a level of detail that was not possible before. Still, the relationship between genotype and low-level cellular processes on one hand, and high-level phenotypic traits on the other, tends to be very complex; measuring does not equal understanding. In the large data sets that are being gathered, it is often hard to uncover patterns that are truly meaningful, and not just arising by random chance.<br/>In this work, we present novel methods for representing, exploring and visualizing genotype-phenotype data sets, with a particular focus on tracking changes driven by evolutionary processes as they occur. One challenge is to be able to quickly search for specific patterns in data coming from large genomes. We have adapted algorithms and data structures from the field of Information Retrieval, relying on inherent genomic structure to make efficient searches. In Paper I, we showcase these techniques with visualization of gene fusions in a study of paediatric B-cell precursor acute lymphoblastic leukaemia.<br/>The complexity of biological processes, taken together with the fact that high-throughput measurements, such as DNA/RNA sequencing data, measure many different things at once, means that these data sets will often contain multiple overlaid signals. If data is collected in the field, rather than produced entirely under controlled conditions in the lab, it is practically unavoidable. In Paper III, we present SMSSVD – SubMatrix Selection Singular Value Decomposition, a parameter-free unsupervised signal decomposition and dimension reduction method, particularly useful for data sets with many variables. By adaptively reducing the noise for each signal, SMSSVD creates a representation with many desirable properties inherited from the ordinary SVD, while being able to discover signals closer to the limit of detection.<br/>In Paper II and Paper IV we describe models for representing genetically related but still heterogeneous microbial populations and show how the composition of the population determines the interaction with the host. The DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) developed in Paper IV, covers the entire workflow from read alignment to visualization of results. We model each population as a positive measure over sequence space and apply SMSSVD to get a robust representation. Using our model, we follow and visualize the evolutionary trajectories of the populations through time, highlighting important minority variants emerging. Finally, we demonstrate the relevance of our population model by showing that it can accurately predict the population fitness, whereas a model based on the consensus sequence fails.},
  author       = {Perby Henningsson, Rasmus},
  isbn         = {ISBN 978-91-7753-480-8},
  keyword      = {Matematik,Bioinformatik,Dimensionsreduktion,Computational mathematics,Bioinformatics,Dimension reduction},
  language     = {eng},
  pages        = {207},
  publisher    = {Centre for Mathematical Sciences, Lund University},
  school       = {Lund University},
  title        = {Dimension Reduction and Signal Decomposition for Genotype–Phenotype Relations},
  year         = {2017},
}