Pointwise and Genomewide Significance Calculations in Gene Mapping through Nonparametric Linkage Analysis: Theory, Algorithms and Applications

Ängquist, Lars

Pointwise and Genomewide Significance Calculations in Gene Mapping through Nonparametric Linkage Analysis: Theory, Algorithms and Applications

Mark

Ängquist, Lars ^LU (2007)

Abstract: In linkage analysis or, in a wider sense, gene mapping one searches for disease loci along a genome. This is done by observing so called marker genotypes (alleles) and phenotypes (affecteds/unaffecteds) of a pedigree set, i.e. a set of multigenerational families, in order to locate the loci corresponding to the underlying disease genes or, at least, to narrow down the interesting genome regions. In this context the key concept is the genetic inheritance of alleles with respect to the phenotype outcomes. A significant deviation from what is expected under random inheritance is taken as statistical evidence of existing genetic components suggested to be located at the loci giving significant results.

In the thesis... (More); In linkage analysis or, in a wider sense, gene mapping one searches for disease loci along a genome. This is done by observing so called marker genotypes (alleles) and phenotypes (affecteds/unaffecteds) of a pedigree set, i.e. a set of multigenerational families, in order to locate the loci corresponding to the underlying disease genes or, at least, to narrow down the interesting genome regions. In this context the key concept is the genetic inheritance of alleles with respect to the phenotype outcomes. A significant deviation from what is expected under random inheritance is taken as statistical evidence of existing genetic components suggested to be located at the loci giving significant results.

In the thesis introduction we begin by outlining the needed genetical foundation of statistical genetics as well as some basic concepts, for instance, the process of allelic inheritance, the genetic disease model, the pedigree set, the inheritance vector and various types of genetic information. Next, we give an introduction to one-locus nonparametric linkage analysis focusing on significance calculations of nonparametric linkage (NPL) scores and, moreover, make some comments on the generalizations to two-locus procedures and the, related but contrasting, approach of parametric linkage analysis. In the third section we very briefly discuss some competing and complementary subfields within the context of statistical genetics and finally we put the papers included in this thesis into context by summarizing their content.

Performing gene mapping-studies through whole, or substantial parts of, the genome gives rise to interpretational problems according to multiple testing. The theme of the thesis is how to calculate significance levels and powers in several contexts of such kind.

In the first two papers one-locus NPL analysis, i.e. where one searches for one disease gene at a time, is considered. In Paper A existing analytical approximations of significance levels are improved and extended. The suggested formula is based on extreme-value theory for stochastic processes and a general link function between a continuous version of an arbitrary distribution function and the standard normal distribution function. In Paper B, in order to calculate significance levels, a new variant of weighted simulation for stochastic processes is developed. The method can handle complete as well as incomplete marker data and is very fast in relation to traditional methods of performing such simulations using Monte Carlo-based algorithms.

The last two papers are directed towards two-locus NPL analysis, i.e. where one is interested in diseases with genetic components based on two distinct (nonsyntenic) disease genes. In Paper C significance levels and powers using unconditional two-locus analysis, i.e. where one simultaneously searches for two disease genes, are derived and discussed for homogeneous pedigree sets based on units of affected sib-pairs. Finally, in Paper D, a general approach for calculation of significance levels and powers in conditional two-locus analysis is developed. The conditional approach might be seen as a hybrid of one-locus and two-locus NPL analysis. Of central importance to this paper is the concept of noncentrality parameters, which basically is the expected value of the test statistic of interest, i.e. the NPL score, under a corresponding instance of the alternative hypotheses. (Less)
Abstract (Swedish): Popular Abstract in Swedish

I kopplingsanalys, eller i en något mera generell mening vid genletning, så söker man efter sjukdomsgener längs ett genom. Här kan man tolka ett genom som en mängd av hela, eller bitar av, olika kromosomer. Med avseende på en mängd sammanhängande flergenerationella släkter så observerar man då, längs genomets kromosombitar, markördata, dvs genotyper bestående av nedärvda anlag (alleler) från fädernet respektive mödernet. Dessa observationer analyseras sedan tillsammans med iakttagelser gällande individernas fenotyper, dvs sjukdomssatus (sjuka/friska/status okänd). Summan av kardemumman är att man vill försöka lokalisera sjukdomsgener genom att finna onormalt starka kopplingar mellan nedärvningen... (More); Popular Abstract in Swedish

I kopplingsanalys, eller i en något mera generell mening vid genletning, så söker man efter sjukdomsgener längs ett genom. Här kan man tolka ett genom som en mängd av hela, eller bitar av, olika kromosomer. Med avseende på en mängd sammanhängande flergenerationella släkter så observerar man då, längs genomets kromosombitar, markördata, dvs genotyper bestående av nedärvda anlag (alleler) från fädernet respektive mödernet. Dessa observationer analyseras sedan tillsammans med iakttagelser gällande individernas fenotyper, dvs sjukdomssatus (sjuka/friska/status okänd). Summan av kardemumman är att man vill försöka lokalisera sjukdomsgener genom att finna onormalt starka kopplingar mellan nedärvningen av anlag vid vissa kromosompositioner (lokus) och fördelningen av fenotyper över släkternas inkluderade individer. Detta vill man åstadkomma med så god precision som möjligt. En nyckelobservation är då att en, i någon mening, signifikant avvikelse med avseende på kopplingen genotyper och fenotyper från vad som kan förväntas under hypotsen om slumpmässig nedärvning statistiskt sett tyder på en genetisk komponent kopplad till motsvarande observationslokus. (Begreppet slumpmässig nedärvning härstammar från Gregor Mendel.) En intressant avvikelse består vanligtvis av att de okika fenotypgrupperna inbördes delar fler nedärvda alleler, i någon mening, än vad som kan anses vara rimligt vid slumpmässig nedärvning.

I avhandlingsintroduktionen så beskrivs de genetiska grundbegreppen som är viktiga för den statistisk-genetiska disciplinen. Dessutom introduceras grundläggande begrepp som, till exempel, nedärvningsprocessen av genetiska anlag, den genetiska sjukdomsmodellen som statistiskt beskriver ramarna för kopplingen mellan fenotyper och genotyper samt hur utbredd sjukdomen är och ibland även var den är lokaliserad, datamaterialet bestående av observerade släkter, nedärvningsvektorn som beskriver hur nedärvningen av anlag har gått till i en specifik släkt och olika sätt att beskriva mängden av tillgänglig genetisk information. Efter detta så ges en introduktion till så kallad enlokus icke-parametrisk kopplingsanalys, där fokus ligger på signifikansberäkningar för en viss typ av teststatistika (NPL scoren). Begreppet icke-parametrisk syftar till att inget antagande om strukturen av den genetiska modellen görs. Enlokusanalys är ett uttryck för att man letar efter ett sjukdomslokus i taget längs det aktuella genomet. Vidare så utförs, vagt uttryck, signifikansberäkningar i syfte att kvantifiera huruvida, vid analysen funna, intressanta resultat avviker (i en statistisk mening) tillräckligt mycket från det normala för att man skall våga tro på att man har hittat något sjukdomsrelaterat lokus. Om man letar efter sjukdomar som är kopplade till nedärvningen med avseende på två stycken sjukdomslokus så utför man en tvålokusanalys. Även vissa generaliseringar till detta utvidgade fall, samt kopplingar och skillnader till den alternativa analysmetoden parametrisk kopplingsanalys, ingår i introduktionen. Vilket kanske kan förstås från relaterad definition ovan så antas vid parametrisk analys en kunskap om underliggande sjukdomsmodell. I den tredje delen av introduktionen så beskrivs översiktligt vissa angränsande och/eller alternativa samt kompletterande forskningsfält inom ramen för den statistisk-genetiska kontexten. Slutligen så sammanfattas innehållet i de i avhandlingen fyra olika inkluderade artiklarna. Detta görs även väldigt kortfattat här nedan.

Allmänt kan sägas att om man letar efter gener över substantiellt stora kromosomområden så ger detta upphov till signifikansmässiga tolkningsproblem på grund av så kallad multipel testning. Huvudinriktningen för avhandlingens fyra papper är att på ett rimligt sätt utföra signifikansberäkningar (även i form av så kallad styrka) i olika situationer relaterade till såväl enlokus som tvålokus icke-parametrisk kopplingsanalys i samband med genomvid multipel testning.

I de två första artiklarna behandlas enlokusanalys vari det första (Papper A) förbättrar och utvidgar vissa existerande analytiska approximationer för att utföra relaterade signifikansberäkningar. Med analytiska approximationer så menas att man härleder formler (slutna uttryck) så att man däri kan sätta in aktuella värden på inkluderade parametrar och således direkt få fram numeriska approximationsvärden. Artikel nummer två (Papper B) behandlar samma problematik men här är approximationerna baserade på så kallade Monte Carlo simuleringar i stället för fasta analytiska approximationsformler. Denna typ av simuleringar innebär, löst uttryckt, att man slumpmässigt (exakt eller approximativt) genererar (simulerar) fram förlopp eller processer av den typ man är intresserad av och sedan analyserar utfallet av dessa förlopp. I det traditionella fallet med Monte Carlo simuleringar med avseende på genomvid icke-parametrisk kopplingsanalys så uppkommer ofta en beräkningsmässig problematik då det tar, i någon mening, för lång tid att generera tillräckligt många förlopp som är analysmässigt intressanta. Detta beror på att den teststatistika (slumpmässig variabel som stoppas in i de analytiska eller simuleringsbaserade approximationsformlerna) generellt sett alltför ofta antar för låga värden under vårt analysscenario (vår nollhypotes). För att lösa detta så inför vi ett slumpmässigt placerat artificiellt sjukdomslokus som då i allmänhet, vid simuleringar, leder till högre värden på teststatistikan i närheten av detta lokus. För att få en korrekt probabilistisk tolkning så korrigerar vi också för denna procedur genom att på ett visst sätt väga samman de olika förloppens resultat med avseende på approximationsformeln (importance sampling, vägd simulering).

De två avslutande artiklarna riktar in sig på tvålokusanalys. Det första av dessa (Papper C) behandlar så kallad obetingad tvålokusanalys, vilket innebär att man simultant eller samtidigt letar efter två olika sjukdomsgener. I vårt fall består mängden av släkter enbart av så kallade sjuka syskonpar, dvs vi har ett homogent familjematerial vari varje familj består av ett par föräldrar och ett par affekterade barn till dessa föräldrar. En generell grundkontext målas upp med begreppsapparat samt diskussion av olika angreppssätt och olika typer av relaterade signifikansberäkningar (signifikansnivåer och styrka) med avseende på diverse möjliga situationer. Slutligen, i den sista artikeln (Papper D), utvecklas ett generellt angreppssätt för signifikansberäkningar et cetera gällande så kallad betingad tvålokusanalys. Den betingade analysen kan ses som en hybrid mellan enlokus- och tvålokusanalys där man betingar med avseende på någon typ av information från ett första betingningslokus innan man letar efter ett andra lokus. Här kan betingningslokusen vara givna apriori eller skattade utifrån en initial enlokusanalys och informationen man betingar på kan vara enlokusresultat (från teststatistikan) eller motsvarande underliggande nedärvningsvektorer. Detta ger alltså upphov till sekventiella snarare än simultana tvålokusmetoder. Man kan också notera att av central betydelse i detta sammanhang är begreppet icke-centralitetsparameter, vilket enkelt uttryckt är ekvivalent med väntevärdet av aktuell teststatistika under en väldefinierad sjukdomsmodell (alternativ hypotes). (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/548075

author

Ängquist, Lars ^LU

supervisor

Ola Hössjer ^LU

opponent

Docent Nilsson, Staffan, Matematiska vetenskaper, Chalmers tekniska högskola

organization

Mathematical Statistics

publishing date

2007

type

Thesis

publication status

published

subject

Probability Theory and Statistics

keywords

programming, actuarial mathematics, Statistik, cytogenetics, Genetik, cytogenetik, Mathematics, Matematik, Statistics, operations research, Genetics, ROC curves, conditioning loci, optimal score functions, noncentrality parameter, cost adjusted relative efficiency, exponential tilting, importance sampling, Monte Carlo simulation, normal approximation, crossover rate, process maximum, analytical approximation, significance calculations, NPL score, conditional linkage analysis, two-locus linkage analysis, Allele sharing, aktuariematematik, operationsanalys, programmering, gene-gene interaction, composite hypotheses, genetic disease models, classes of score functions, nonparametric linkage analysis

pages

257 pages

publisher

Centre for Mathematical Sciences, Lund University

defense location

MH:A of Centre for Mathematical Sciences, Sölvegatan 18, Lund.

defense date

2007-03-16 09:15:00

external identifiers

other:ISRN: LUNFMS-1019-2006

ISBN

978-91-628-7068-3

language

English

LU publication?

yes

id

1ca5a2c1-e919-4493-8ca2-29947e650f37 (old id 548075)

date added to LUP

2016-04-01 16:08:06

date last changed

2025-04-04 15:07:42

@phdthesis{1ca5a2c1-e919-4493-8ca2-29947e650f37,
  abstract     = {{In linkage analysis or, in a wider sense, gene mapping one searches for disease loci along a genome. This is done by observing so called marker genotypes (alleles) and phenotypes (affecteds/unaffecteds) of a pedigree set, i.e. a set of multigenerational families, in order to locate the loci corresponding to the underlying disease genes or, at least, to narrow down the interesting genome regions. In this context the key concept is the genetic inheritance of alleles with respect to the phenotype outcomes. A significant deviation from what is expected under random inheritance is taken as statistical evidence of existing genetic components suggested to be located at the loci giving significant results.<br/><br>
<br/><br>
In the thesis introduction we begin by outlining the needed genetical foundation of statistical genetics as well as some basic concepts, for instance, the process of allelic inheritance, the genetic disease model, the pedigree set, the inheritance vector and various types of genetic information. Next, we give an introduction to one-locus nonparametric linkage analysis focusing on significance calculations of nonparametric linkage (NPL) scores and, moreover, make some comments on the generalizations to two-locus procedures and the, related but contrasting, approach of parametric linkage analysis. In the third section we very briefly discuss some competing and complementary subfields within the context of statistical genetics and finally we put the papers included in this thesis into context by summarizing their content.<br/><br>
<br/><br>
Performing gene mapping-studies through whole, or substantial parts of, the genome gives rise to interpretational problems according to multiple testing. The theme of the thesis is how to calculate significance levels and powers in several contexts of such kind.<br/><br>
<br/><br>
In the first two papers one-locus NPL analysis, i.e. where one searches for one disease gene at a time, is considered. In Paper A existing analytical approximations of significance levels are improved and extended. The suggested formula is based on extreme-value theory for stochastic processes and a general link function between a continuous version of an arbitrary distribution function and the standard normal distribution function. In Paper B, in order to calculate significance levels, a new variant of weighted simulation for stochastic processes is developed. The method can handle complete as well as incomplete marker data and is very fast in relation to traditional methods of performing such simulations using Monte Carlo-based algorithms.<br/><br>
<br/><br>
The last two papers are directed towards two-locus NPL analysis, i.e. where one is interested in diseases with genetic components based on two distinct (nonsyntenic) disease genes. In Paper C significance levels and powers using unconditional two-locus analysis, i.e. where one simultaneously searches for two disease genes, are derived and discussed for homogeneous pedigree sets based on units of affected sib-pairs. Finally, in Paper D, a general approach for calculation of significance levels and powers in conditional two-locus analysis is developed. The conditional approach might be seen as a hybrid of one-locus and two-locus NPL analysis. Of central importance to this paper is the concept of noncentrality parameters, which basically is the expected value of the test statistic of interest, i.e. the NPL score, under a corresponding instance of the alternative hypotheses.}},
  author       = {{Ängquist, Lars}},
  isbn         = {{978-91-628-7068-3}},
  keywords     = {{programming; actuarial mathematics; Statistik; cytogenetics; Genetik; cytogenetik; Mathematics; Matematik; Statistics; operations research; Genetics; ROC curves; conditioning loci; optimal score functions; noncentrality parameter; cost adjusted relative efficiency; exponential tilting; importance sampling; Monte Carlo simulation; normal approximation; crossover rate; process maximum; analytical approximation; significance calculations; NPL score; conditional linkage analysis; two-locus linkage analysis; Allele sharing; aktuariematematik; operationsanalys; programmering; gene-gene interaction; composite hypotheses; genetic disease models; classes of score functions; nonparametric linkage analysis}},
  language     = {{eng}},
  publisher    = {{Centre for Mathematical Sciences, Lund University}},
  school       = {{Lund University}},
  title        = {{Pointwise and Genomewide Significance Calculations in Gene Mapping through Nonparametric Linkage Analysis: Theory, Algorithms and Applications}},
  year         = {{2007}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Pointwise and Genomewide Significance Calculations in Gene Mapping through Nonparametric Linkage Analysis: Theory, Algorithms and Applications