Dimension Reduction and Signal Decomposition for Genotype–Phenotype Relations

Perby Henningsson, Rasmus

Dimension Reduction and Signal Decomposition for Genotype–Phenotype Relations

Mark

Perby Henningsson, Rasmus ^LU (2017)

Abstract: Over the last few decades, DNA sequencing has developed from costing billions of dollars to get the complete sequence of the human genome, to being a routine procedure performed in labs all around the world. This has transformed the field of experimental biology since measurements can be done at a level of detail that was not possible before. Still, the relationship between genotype and low-level cellular processes on one hand, and high-level phenotypic traits on the other, tends to be very complex; measuring does not equal understanding. In the large data sets that are being gathered, it is often hard to uncover patterns that are truly meaningful, and not just arising by random chance.
In this work, we present novel methods for... (More); Over the last few decades, DNA sequencing has developed from costing billions of dollars to get the complete sequence of the human genome, to being a routine procedure performed in labs all around the world. This has transformed the field of experimental biology since measurements can be done at a level of detail that was not possible before. Still, the relationship between genotype and low-level cellular processes on one hand, and high-level phenotypic traits on the other, tends to be very complex; measuring does not equal understanding. In the large data sets that are being gathered, it is often hard to uncover patterns that are truly meaningful, and not just arising by random chance.
In this work, we present novel methods for representing, exploring and visualizing genotype-phenotype data sets, with a particular focus on tracking changes driven by evolutionary processes as they occur. One challenge is to be able to quickly search for specific patterns in data coming from large genomes. We have adapted algorithms and data structures from the field of Information Retrieval, relying on inherent genomic structure to make efficient searches. In Paper I, we showcase these techniques with visualization of gene fusions in a study of paediatric B-cell precursor acute lymphoblastic leukaemia.
The complexity of biological processes, taken together with the fact that high-throughput measurements, such as DNA/RNA sequencing data, measure many different things at once, means that these data sets will often contain multiple overlaid signals. If data is collected in the field, rather than produced entirely under controlled conditions in the lab, it is practically unavoidable. In Paper III, we present SMSSVD – SubMatrix Selection Singular Value Decomposition, a parameter-free unsupervised signal decomposition and dimension reduction method, particularly useful for data sets with many variables. By adaptively reducing the noise for each signal, SMSSVD creates a representation with many desirable properties inherited from the ordinary SVD, while being able to discover signals closer to the limit of detection.
In Paper II and Paper IV we describe models for representing genetically related but still heterogeneous microbial populations and show how the composition of the population determines the interaction with the host. The DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) developed in Paper IV, covers the entire workflow from read alignment to visualization of results. We model each population as a positive measure over sequence space and apply SMSSVD to get a robust representation. Using our model, we follow and visualize the evolutionary trajectories of the populations through time, highlighting important minority variants emerging. Finally, we demonstrate the relevance of our population model by showing that it can accurately predict the population fitness, whereas a model based on the consensus sequence fails. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/9e162841-3bb5-44da-846f-6bf9d095212f

author

Perby Henningsson, Rasmus ^LU

supervisor

Magnus Fontes ^LU
Thoas Fioretos ^LU

opponent

Professor Krogh, Anders, University of Copenhagen, Denmark

organization

Mathematics (Faculty of Engineering)

publishing date

2017-12

type

Thesis

publication status

published

subject

Mathematical Sciences

keywords

Matematik, Bioinformatik, Dimensionsreduktion, Computational mathematics, Bioinformatics, Dimension reduction

pages

207 pages

publisher

Centre for Mathematical Sciences, Lund University

defense location

lecture hall Hörmandersalen, Centre for Mathematical Sciences, Sölvegatan 18, Lund University, Faculty of Engineering LTH, Lund

defense date

2018-02-02 13:15:00

ISBN

978-91-7753-479-2

978-91-7753-480-8

language

English

LU publication?

yes

id

9e162841-3bb5-44da-846f-6bf9d095212f

date added to LUP

2018-01-08 11:52:35

date last changed

2025-04-04 14:05:41

@phdthesis{9e162841-3bb5-44da-846f-6bf9d095212f,
  abstract     = {{Over the last few decades, DNA sequencing has developed from costing billions of dollars to get the complete sequence of the human genome, to being a routine procedure performed in labs all around the world. This has transformed the field of experimental biology since measurements can be done at a level of detail that was not possible before. Still, the relationship between genotype and low-level cellular processes on one hand, and high-level phenotypic traits on the other, tends to be very complex; measuring does not equal understanding. In the large data sets that are being gathered, it is often hard to uncover patterns that are truly meaningful, and not just arising by random chance.<br/>In this work, we present novel methods for representing, exploring and visualizing genotype-phenotype data sets, with a particular focus on tracking changes driven by evolutionary processes as they occur. One challenge is to be able to quickly search for specific patterns in data coming from large genomes. We have adapted algorithms and data structures from the field of Information Retrieval, relying on inherent genomic structure to make efficient searches. In Paper I, we showcase these techniques with visualization of gene fusions in a study of paediatric B-cell precursor acute lymphoblastic leukaemia.<br/>The complexity of biological processes, taken together with the fact that high-throughput measurements, such as DNA/RNA sequencing data, measure many different things at once, means that these data sets will often contain multiple overlaid signals. If data is collected in the field, rather than produced entirely under controlled conditions in the lab, it is practically unavoidable. In Paper III, we present SMSSVD – SubMatrix Selection Singular Value Decomposition, a parameter-free unsupervised signal decomposition and dimension reduction method, particularly useful for data sets with many variables. By adaptively reducing the noise for each signal, SMSSVD creates a representation with many desirable properties inherited from the ordinary SVD, while being able to discover signals closer to the limit of detection.<br/>In Paper II and Paper IV we describe models for representing genetically related but still heterogeneous microbial populations and show how the composition of the population determines the interaction with the host. The DISSEQT pipeline (DIStribution-based SEQuence space Time dynamics) developed in Paper IV, covers the entire workflow from read alignment to visualization of results. We model each population as a positive measure over sequence space and apply SMSSVD to get a robust representation. Using our model, we follow and visualize the evolutionary trajectories of the populations through time, highlighting important minority variants emerging. Finally, we demonstrate the relevance of our population model by showing that it can accurately predict the population fitness, whereas a model based on the consensus sequence fails.}},
  author       = {{Perby Henningsson, Rasmus}},
  isbn         = {{978-91-7753-479-2}},
  keywords     = {{Matematik; Bioinformatik; Dimensionsreduktion; Computational mathematics; Bioinformatics; Dimension reduction}},
  language     = {{eng}},
  publisher    = {{Centre for Mathematical Sciences, Lund University}},
  school       = {{Lund University}},
  title        = {{Dimension Reduction and Signal Decomposition for Genotype–Phenotype Relations}},
  url          = {{https://lup.lub.lu.se/search/files/36565484/thesis_printed_final.pdf}},
  year         = {{2017}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Dimension Reduction and Signal Decomposition for Genotype–Phenotype Relations