Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated

Elhaik, Eran LU orcid (2022) In Scientific Reports 12.
Abstract

Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, dispersion, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are... (More)

Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, dispersion, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We analyzed twelve common test cases using an intuitive color-based model alongside human population data. We demonstrate that PCA results can be artifacts of the data and can be easily manipulated to generate desired outcomes. PCA adjustment also yielded unfavorable outcomes in association studies. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns about the validity of results reported in the population genetics literature and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations and that 32,000-216,000 genetic studies should be reevaluated. An alternative mixed-admixture population genetic model is discussed.

(Less)
Please use this url to cite or link to this publication:
author
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Algorithms, Animals, Artifacts, Genetics, Population, Humans, Principal Component Analysis
in
Scientific Reports
volume
12
article number
14683
publisher
Nature Publishing Group
external identifiers
  • pmid:36038559
  • scopus:85136868348
ISSN
2045-2322
DOI
10.1038/s41598-022-14395-4
language
English
LU publication?
yes
additional info
© 2022. The Author(s).
id
fb257946-2b4c-464a-9746-7228357fa685
date added to LUP
2022-09-11 01:50:03
date last changed
2024-04-18 09:50:08
@article{fb257946-2b4c-464a-9746-7228357fa685,
  abstract     = {{<p>Principal Component Analysis (PCA) is a multivariate analysis that reduces the complexity of datasets while preserving data covariance. The outcome can be visualized on colorful scatterplots, ideally with only a minimal loss of information. PCA applications, implemented in well-cited packages like EIGENSOFT and PLINK, are extensively used as the foremost analyses in population genetics and related fields (e.g., animal and plant or medical genetics). PCA outcomes are used to shape study design, identify, and characterize individuals and populations, and draw historical and ethnobiological conclusions on origins, evolution, dispersion, and relatedness. The replicability crisis in science has prompted us to evaluate whether PCA results are reliable, robust, and replicable. We analyzed twelve common test cases using an intuitive color-based model alongside human population data. We demonstrate that PCA results can be artifacts of the data and can be easily manipulated to generate desired outcomes. PCA adjustment also yielded unfavorable outcomes in association studies. PCA results may not be reliable, robust, or replicable as the field assumes. Our findings raise concerns about the validity of results reported in the population genetics literature and related fields that place a disproportionate reliance upon PCA outcomes and the insights derived from them. We conclude that PCA may have a biasing role in genetic investigations and that 32,000-216,000 genetic studies should be reevaluated. An alternative mixed-admixture population genetic model is discussed.</p>}},
  author       = {{Elhaik, Eran}},
  issn         = {{2045-2322}},
  keywords     = {{Algorithms; Animals; Artifacts; Genetics, Population; Humans; Principal Component Analysis}},
  language     = {{eng}},
  month        = {{08}},
  publisher    = {{Nature Publishing Group}},
  series       = {{Scientific Reports}},
  title        = {{Principal Component Analyses (PCA)-based findings in population genetic studies are highly biased and must be reevaluated}},
  url          = {{http://dx.doi.org/10.1038/s41598-022-14395-4}},
  doi          = {{10.1038/s41598-022-14395-4}},
  volume       = {{12}},
  year         = {{2022}},
}