Modelling Allelic and DNA Copy Number Variations using Continuous-index Hidden Markov Models

Stjernqvist, Susann

Modelling Allelic and DNA Copy Number Variations using Continuous-index Hidden Markov Models

Mark

Stjernqvist, Susann ^LU (2010) In Doctoral Theses in Mathematical Sciences 2010:7.

Abstract: In human cells there are usually two copies of each chromosome, but in cancer cells abnormalities could exist. The differences consist of segments of chromosomes with an altered number of copies. There can be deletions as well as amplifications and the lengths of the segments can also vary. Localising the deviant regions is of great importance for increasing the knowledge of the disease. In this thesis the copy numbers are modelled using Hidden Markov Models (HMMs). A hidden Markov process can be described as a Markov process observed in noise; thus it consists of two different processes such that one is an unobservable Markov process, while the other is the observed process.

In paper A we present a method suitable for... (More); In human cells there are usually two copies of each chromosome, but in cancer cells abnormalities could exist. The differences consist of segments of chromosomes with an altered number of copies. There can be deletions as well as amplifications and the lengths of the segments can also vary. Localising the deviant regions is of great importance for increasing the knowledge of the disease. In this thesis the copy numbers are modelled using Hidden Markov Models (HMMs). A hidden Markov process can be described as a Markov process observed in noise; thus it consists of two different processes such that one is an unobservable Markov process, while the other is the observed process.

In paper A we present a method suitable for aCGH data from tiling BAC arrays, i.e. the probes are rather long and could overlap. In addition they are of unequal lengths and unevenly spread over the genome, which makes it suitable to apply a continuous-index process. We assume the Markov model to have a discrete state space and the parameters are estimated with an MCEM algorithm. The model in paper B is a modification of the model in paper A, such that the Markov process takes values in a continuous state space. This makes the method more realistic since it can handle larger differences in the data, including systematic errors. In addition we assume some of the transition rates to be common to get a parsimonious model. We take a Bayesian approach and use reversible jump MCMC to simulate the Markov process.

In paper C we present a model designed for SNP data which consists of allelic intensities for the two alleles at each SNP. We assume a discrete number of states, but keep the parsimonious approach from paper B such that some of the transition rates are common. The SNPs are point measurements but unevenly spread over the genome which motivates a continuous-index process. Further on in paper D we present an MCMC sampler, which is suitable for hidden Markov models, when taking a Bayesian approach. We alternate between updating the parameters and the trajectory, and for the latter update we present a sequential Monte Carlo method based on forward filtering-backward simulation. The method is applied on oligonucleotide copy number data with the same model as in paper B. (Less)
Abstract (Swedish): Popular Abstract in Swedish

DNA i form av kromosomer finns i cellerna och innehåller information som styr flertalet av kroppens funktioner. Människor har vanligtvis 23 par kromosomer, där den ena kromosomen i ett par härstammar från personens mamma och den andra från dess pappa. DNA ser ut som en vriden stege där stegpinnarna består av nukleotidbaspar.

I cancerceller kan det finnas bitar av kromosomer som finns i ett annat antal exemplar än två. En möjlighet är att en bit av någon av de ursprungliga kromosomerna har kopierats så att det finns extra kopior av dessa bitar, och en annan möjlighet är att en bit av någon av kopiorna har förlorats. Genom att identifiera vilka delar av kromosomerna som har ett... (More); Popular Abstract in Swedish

DNA i form av kromosomer finns i cellerna och innehåller information som styr flertalet av kroppens funktioner. Människor har vanligtvis 23 par kromosomer, där den ena kromosomen i ett par härstammar från personens mamma och den andra från dess pappa. DNA ser ut som en vriden stege där stegpinnarna består av nukleotidbaspar.

I cancerceller kan det finnas bitar av kromosomer som finns i ett annat antal exemplar än två. En möjlighet är att en bit av någon av de ursprungliga kromosomerna har kopierats så att det finns extra kopior av dessa bitar, och en annan möjlighet är att en bit av någon av kopiorna har förlorats. Genom att identifiera vilka delar av kromosomerna som har ett felaktigt antal kopior kan kunskapen om cancer öka och till exempel metoder för att upptäcka och behandla sjukdomen förbättras.

Vilka segment som har ett avvikande antal kopior varierar mellan olika patienter och för att kunna beskriva den variationen passar det att använda en modell som innehåller slumpen. En lämplig modell är då Markovprocesser, som beskriver antalet kopior vid varje basparsposition. Det speciella med Markovprocesser är att antalet kopior vid en basparsposition beror på antalet kopior vid intilliggande basparspositioner, men inte på de som är längre bort. Markovprocesser beskrivs med olika sannolikheter, vilket innebär att om det vid en basparsposition till exempel finns två kopior så finns det en sannolikhet att det vid nästa basparsposition också finns två kopior, en annan sannolikhet att det finns en kopia, en tredje sannolikhet för att det finns tre kopior och så vidare. Det passar bra att beskriva just DNA kopior med dessa processer eftersom intilliggande baspar ofta har samma antal kopior.

Att mäta antalet kopior av kromosomerna är en komplicerad process och speciell teknisk utrustning används. Detta gör att mätningarna innehåller olika sorters mätfel vilket medför att det inte direkt går att avgöra hur många kopior som finns för varje basparsposition. Då kan man inte enbart använda Markovprocesser utan modellen måste även inkludera mätfelen. Dessa beskrivs också lämpligtvis med hjälp av en statistisk modell och det ger då det som kallas för en dold Markovmodell.

I den här avhandlingen används olika sorters dolda Markovmodeller och för att analysera dem har flera statistiska metoder, som bland annat uppskattar värdet på olika parametrar, utvecklats. Det ger då information såsom att det i en viss region är mest troligt att det finns fyra kopior av kromosomerna och att det i en annan är mest troligt med två kopior. Den informationen kan sedan ligga till grund för vilka gener som ska studeras vidare och var det är störst chans att hitta cancerframkallande gener. Det kan också vara intressant att jämföra resultatet från flera patienter och dra slutsatser som till exempel att alla som har ett visst antal kopior i en viss region har samma variant av sjukdomen. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/1681910

author

Stjernqvist, Susann ^LU

supervisor

Tobias Rydén ^LU

opponent

Professor Hössjer, Ola, Department of Mathematics, Stockholm University

organization

Mathematical Statistics

publishing date

2010

type

Thesis

publication status

published

subject

Probability Theory and Statistics

keywords

Hidden Markov models, DNA copy number, allelic copy number, Markov chain Monte Carlo

in

Doctoral Theses in Mathematical Sciences

volume

2010:7

pages

161 pages

publisher

Mathematical Statistics, Centre for Mathematical Sciences, Lund University

defense location

MH:C

defense date

2010-10-22 10:15:00

ISSN

1404-0034

language

English

LU publication?

yes

id

6e092c3b-a54d-4c23-96f9-2aafa5a3c763 (old id 1681910)

date added to LUP

2016-04-01 14:29:48

date last changed

2025-04-04 14:21:12

@phdthesis{6e092c3b-a54d-4c23-96f9-2aafa5a3c763,
  abstract     = {{In human cells there are usually two copies of each chromosome, but in cancer cells abnormalities could exist. The differences consist of segments of chromosomes with an altered number of copies. There can be deletions as well as amplifications and the lengths of the segments can also vary. Localising the deviant regions is of great importance for increasing the knowledge of the disease. In this thesis the copy numbers are modelled using Hidden Markov Models (HMMs). A hidden Markov process can be described as a Markov process observed in noise; thus it consists of two different processes such that one is an unobservable Markov process, while the other is the observed process. <br/><br>
<br/><br>
In paper A we present a method suitable for aCGH data from tiling BAC arrays, i.e. the probes are rather long and could overlap. In addition they are of unequal lengths and unevenly spread over the genome, which makes it suitable to apply a continuous-index process. We assume the Markov model to have a discrete state space and the parameters are estimated with an MCEM algorithm. The model in paper B is a modification of the model in paper A, such that the Markov process takes values in a continuous state space. This makes the method more realistic since it can handle larger differences in the data, including systematic errors. In addition we assume some of the transition rates to be common to get a parsimonious model. We take a Bayesian approach and use reversible jump MCMC to simulate the Markov process. <br/><br>
<br/><br>
In paper C we present a model designed for SNP data which consists of allelic intensities for the two alleles at each SNP. We assume a discrete number of states, but keep the parsimonious approach from paper B such that some of the transition rates are common. The SNPs are point measurements but unevenly spread over the genome which motivates a continuous-index process. Further on in paper D we present an MCMC sampler, which is suitable for hidden Markov models, when taking a Bayesian approach. We alternate between updating the parameters and the trajectory, and for the latter update we present a sequential Monte Carlo method based on forward filtering-backward simulation. The method is applied on oligonucleotide copy number data with the same model as in paper B.}},
  author       = {{Stjernqvist, Susann}},
  issn         = {{1404-0034}},
  keywords     = {{Hidden Markov models; DNA copy number; allelic copy number; Markov chain Monte Carlo}},
  language     = {{eng}},
  publisher    = {{Mathematical Statistics, Centre for Mathematical Sciences, Lund University}},
  school       = {{Lund University}},
  series       = {{Doctoral Theses in Mathematical Sciences}},
  title        = {{Modelling Allelic and DNA Copy Number Variations using Continuous-index Hidden Markov Models}},
  volume       = {{2010:7}},
  year         = {{2010}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Modelling Allelic and DNA Copy Number Variations using Continuous-index Hidden Markov Models