Tools and annotations for variation

Schaafsma, Gerard C.P.

Tools and annotations for variation

Mark

Schaafsma, Gerard C.P. ^LU

(2017)

Abstract: Since the finishing of the Human Genome Project, many next-generation (NGS) or high-throughput sequencing platforms have emerged. One of the applications of NGS technology, variant discovery, can serve as a basis for precision medicine. Large sequencing projects are generating huge amounts of genetic variation data, which are stored in databases, either large central databases such as dbSNP, or gene- or disease-centered locus-specific databases (LSDBs). There are many variation databases with many different formats and varying quality. Apart from storage and analysis pipeline capacity problems, the interpretation of the variation is also an issue. Computational methods for predicting the effects of variants have been and are being... (More); Since the finishing of the Human Genome Project, many next-generation (NGS) or high-throughput sequencing platforms have emerged. One of the applications of NGS technology, variant discovery, can serve as a basis for precision medicine. Large sequencing projects are generating huge amounts of genetic variation data, which are stored in databases, either large central databases such as dbSNP, or gene- or disease-centered locus-specific databases (LSDBs). There are many variation databases with many different formats and varying quality. Apart from storage and analysis pipeline capacity problems, the interpretation of the variation is also an issue. Computational methods for predicting the effects of variants have been and are being developed, since experimental assessment of variation effects is often not feasible. Benchmark datasets are needed for the development and for performance assessment of such prediction methods.
We studied quality related aspects of variant databases and benchmark datasets. The online tool called VariOtator was developed to aid in the consistent use of the Variation Ontology, which was specifically developed to describe variation. Standardization is one aspect of database quality; the use of an ontology for variant annotation will contribute to the enhancement of it.
BTKbase is a locus-specific database containing information on variants in BTK, the gene involved in X-linked agammaglobulinemia (XLA), a primary immunodeficiency. If available, phenotypic data, i.e. the variant effects, are also provided. Statistics on variants and variation types showed that there is a wide spectrum of variants and variation types, and that the distribution of protein variants in the different BTK domains is not even.
The VariSNP database containing datasets with neutral (non-pathogenic) variants was generated by selecting variants from dbSNP and filtering for variants found in the ClinVar, PhenCode and SwissProt databases. Variants in these three databases are considered to be disease-related. The VariSNP database contains 13 datasets following the functional classification of dbSNP, and is updated on a regular basis.
To study the sensitivity to variation in different protein and disease groups, we predicted the pathogenicity of all possible single amino acid substitutions (SAASs) in all proteins in these groups, using the well-performing prediction method PON P2. Large differences in the proportions of harmful, benign and unknown variants were found, and distinctive patterns of SAAS types were found, both in the original and variant amino acids.
Representativeness is one quality aspect of variation benchmark datasets, and relates to the representation of the space of variants and their effects. We studied the coverage and distribution of protein features, including structure (CATH) and enzyme classification (EC), Pfam domains and Gene Ontology terms, in established benchmark datasets. None of the datasets is fully representative. Coverage of the features is in general better in the larger datasets, and better in the neutral datasets. At the higher levels of the CATH and EC classifications, all datasets were unbiased, but for the lower levels and other features, all datasets were biased. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/60db6c14-c515-456e-9357-a366c6075d13

author

Schaafsma, Gerard C.P. ^LU

supervisor

Mauno Vihinen ^LU
Mattias Höglund ^LU

opponent

professor Keller, Andreas, Saarland University, Saarbrücken

organization

Protein Bioinformatics (research group)

publishing date

2017

type

Thesis

publication status

published

keywords

Annotation, genetic variation, benchmarks, databases, disease groups, pathogenicity, proteins, representativeness, sensitivity, variant effect analysis

pages

58 pages

publisher

Lund University, Faculty of Medicine

defense location

Dora Jacobsohn, BMC D15, Klinikgatan 32, Lund

defense date

2017-09-29 13:00:00

ISBN

978-91-7619-515-4

language

English

LU publication?

yes

additional info

ISSN: 1652-8220 Lund University, Faculty of Medicine Doctoral Dissertation Series 2017:132

id

60db6c14-c515-456e-9357-a366c6075d13

date added to LUP

2017-09-11 13:24:23

date last changed

2025-10-21 13:04:35

@phdthesis{60db6c14-c515-456e-9357-a366c6075d13,
  abstract     = {{Since the finishing of the Human Genome Project, many next-generation (NGS) or high-throughput sequencing platforms have emerged. One of the applications of NGS technology, variant discovery, can serve as a basis for precision medicine. Large sequencing projects are generating huge amounts of genetic variation data, which are stored in databases, either large central databases such as dbSNP, or gene- or disease-centered locus-specific databases (LSDBs). There are many variation databases with many different formats and varying quality. Apart from storage and analysis pipeline capacity problems, the interpretation of the variation is also an issue. Computational methods for predicting the effects of variants have been and are being developed, since experimental assessment of variation effects is often not feasible. Benchmark datasets are needed for the development and for performance assessment of such prediction methods.<br/>We studied quality related aspects of variant databases and benchmark datasets. The online tool called VariOtator was developed to aid in the consistent use of the Variation Ontology, which was specifically developed to describe variation. Standardization is one aspect of database quality; the use of an ontology for variant annotation will contribute to the enhancement of it.<br/>BTKbase is a locus-specific database containing information on variants in BTK, the gene involved in X-linked agammaglobulinemia (XLA), a primary immunodeficiency. If available, phenotypic data, i.e. the variant effects, are also provided. Statistics on variants and variation types showed that there is a wide spectrum of variants and variation types, and that the distribution of protein variants in the different BTK domains is not even.<br/>The VariSNP database containing datasets with neutral (non-pathogenic) variants was generated by selecting variants from dbSNP and filtering for variants found in the ClinVar, PhenCode and SwissProt databases. Variants in these three databases are considered to be disease-related. The VariSNP database contains 13 datasets following the functional classification of dbSNP, and is updated on a regular basis.<br/>To study the sensitivity to variation in different protein and disease groups, we predicted the pathogenicity of all possible single amino acid substitutions (SAASs) in all proteins in these groups, using the well-performing prediction method PON P2. Large differences in the proportions of harmful, benign and unknown variants were found, and distinctive patterns of SAAS types were found, both in the original and variant amino acids.<br/>Representativeness is one quality aspect of variation benchmark datasets, and relates to the representation of the space of variants and their effects. We studied the coverage and distribution of protein features, including structure (CATH) and enzyme classification (EC), Pfam domains and Gene Ontology terms, in established benchmark datasets. None of the datasets is fully representative. Coverage of the features is in general better in the larger datasets, and better in the neutral datasets. At the higher levels of the CATH and EC classifications, all datasets were unbiased, but for the lower levels and other features, all datasets were biased.}},
  author       = {{Schaafsma, Gerard C.P.}},
  isbn         = {{978-91-7619-515-4}},
  keywords     = {{Annotation, genetic variation, benchmarks, databases, disease groups, pathogenicity, proteins, representativeness, sensitivity, variant effect analysis}},
  language     = {{eng}},
  publisher    = {{Lund University, Faculty of Medicine}},
  school       = {{Lund University}},
  title        = {{Tools and annotations for variation}},
  url          = {{https://lup.lub.lu.se/search/files/31065430/Gerard_Schaafsma_incl_Cover.pdf}},
  year         = {{2017}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Tools and annotations for variation