Turning vice into virtue : Using Batch-Effects to Detect Errors in Large Genomic Datasets

Mafessoni, Fabrizio; Prasad, Rashmi B; Groop, Leif; Hansson, Ola; Prüfer, Kay

Turning vice into virtue : Using Batch-Effects to Detect Errors in Large Genomic Datasets

Mark

Mafessoni, Fabrizio ; Prasad, Rashmi B ^LU

; Groop, Leif ^LU ; Hansson, Ola ^LU

and Prüfer, Kay (2018) In Genome Biology and Evolution 10(10). p.2697-2708

Abstract: It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling datasets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined datasetIs. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes dataset, we find that coding regions are enriched for errors, where about 1% of the... (More); It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling datasets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined datasetIs. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes dataset, we find that coding regions are enriched for errors, where about 1% of the higher-frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (<0.001%).As expected, predicted errors are found less often than other variants in a dataset that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large datasets; our observation is thus not specific to the 1000 Genomes dataset. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale datasets to detect systematic errors.
(Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/3aeb382c-fa6d-4304-b442-679808487ac4

author

Mafessoni, Fabrizio ; Prasad, Rashmi B ^LU

; Groop, Leif ^LU ; Hansson, Ola ^LU

and Prüfer, Kay

organization

publishing date

2018-09-10

type

Contribution to journal

publication status

published

subject

in

Genome Biology and Evolution

volume

10

issue

10

pages

2697 - 2708

publisher

Oxford University Press

external identifiers

scopus:85054893761
pmid:30204860

ISSN

1759-6653

DOI

10.1093/gbe/evy199

language

English

LU publication?

yes

id

3aeb382c-fa6d-4304-b442-679808487ac4

date added to LUP

2018-09-14 17:55:41

date last changed

2026-01-08 23:21:52

@article{3aeb382c-fa6d-4304-b442-679808487ac4,
  abstract     = {{<p>It is often unavoidable to combine data from different sequencing centers or sequencing platforms when compiling datasets with a large number of individuals. However, the different data are likely to contain specific systematic errors that will appear as SNPs. Here, we devise a method to detect systematic errors in combined datasetIs. To measure quality differences between individual genomes, we study pairs of variants that reside on different chromosomes and co-occur in individuals. The abundance of these pairs of variants in different genomes is then used to detect systematic errors due to batch effects. Applying our method to the 1000 Genomes dataset, we find that coding regions are enriched for errors, where about 1% of the higher-frequency variants are predicted to be erroneous, whereas errors outside of coding regions are much rarer (&lt;0.001%).As expected, predicted errors are found less often than other variants in a dataset that was generated with a different sequencing technology, indicating that many of the candidates are indeed errors. However, predicted 1000 Genomes errors are also found in other large datasets; our observation is thus not specific to the 1000 Genomes dataset. Our results show that batch effects can be turned into a virtue by using the resulting variation in large scale datasets to detect systematic errors.</p>}},
  author       = {{Mafessoni, Fabrizio and Prasad, Rashmi B and Groop, Leif and Hansson, Ola and Prüfer, Kay}},
  issn         = {{1759-6653}},
  language     = {{eng}},
  month        = {{09}},
  number       = {{10}},
  pages        = {{2697--2708}},
  publisher    = {{Oxford University Press}},
  series       = {{Genome Biology and Evolution}},
  title        = {{Turning vice into virtue : Using Batch-Effects to Detect Errors in Large Genomic Datasets}},
  url          = {{http://dx.doi.org/10.1093/gbe/evy199}},
  doi          = {{10.1093/gbe/evy199}},
  volume       = {{10}},
  year         = {{2018}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Turning vice into virtue : Using Batch-Effects to Detect Errors in Large Genomic Datasets