Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

PhyloPyPruner: Tree-based Orthology Inference for Phylogenomics with New Methods for Identifying and Excluding Contamination

Thalén, Felix (2018) BINP52 20181
Degree Projects in Bioinformatics
Abstract
Motivation: Large-scale phylogenetic analyses rely on orthology inference to curate sets of sequences related by speciation rather than gene duplication. Graph-based ‘orthology’ inference approaches cluster sequences together based on an all-versus-all BLAST, followed by filtering by hit fraction, Markov clustering, or both, but the output of such approaches often contains paralogous sequences. Tree-based orthology inference approaches solve this problem by inferring orthology based on phylogenetic trees generated from the output of such graph-based approaches. Unfortunately, contaminant sequences present in even a single taxon can cause such approaches to erroneously infer paralogy and unnecessarily exclude many sequences.
Results: We... (More)
Motivation: Large-scale phylogenetic analyses rely on orthology inference to curate sets of sequences related by speciation rather than gene duplication. Graph-based ‘orthology’ inference approaches cluster sequences together based on an all-versus-all BLAST, followed by filtering by hit fraction, Markov clustering, or both, but the output of such approaches often contains paralogous sequences. Tree-based orthology inference approaches solve this problem by inferring orthology based on phylogenetic trees generated from the output of such graph-based approaches. Unfortunately, contaminant sequences present in even a single taxon can cause such approaches to erroneously infer paralogy and unnecessarily exclude many sequences.
Results: We present PhyloPyPruner, a Python package for tree-based orthology inference that refines the output of a graph-based approach. The program builds on previous tools in addition to implementing new methods for differentiating and removing contamination-like sequences. Our novel paralogy frequency (PF) metric calculates the number of inferred paralogs for a given operational taxonomic unit (OTU) divided by the number of alignments in which that OTU is present. Because contamination typically results in two or more sequences from an OTU that do not form a clade in a single-gene tree, which is interpreted as paralogy by tree-based orthology inference algorithms, visualizing PF can help to identify OTUs with contamination. PhyloPyPruner can be configured to automatically remove OTUs with high PF relative to other OTUs. In addition, in cases where two or more sequences from a single OTU are present, these sequences can be removed if the maximum pairwise distance between sequences from the OTU is relatively high compared to the average pairwise distance to sequences outside of the OTU. Further, two or more sequences from different OTUs with a zero (or nearly zero) pairwise distance may be the result of crosscontamination and can removed on a per-alignment basis, based on a user-defined threshold. Finally, taxon jackknifing excludes OTUs one-by-one, during tree-based orthology inference. This enables the user to identify taxa whose exclusion improves metrics of supermatrix quality such as the number of alignments retained or percent missing data. We demonstrate the utility of PhyloPyPruner by running it on three phylogenomic datasets, recovering more genes suitable for phylogeny reconstruction, while reducing missing data and identifying and reducing contamination. (Less)
Popular Abstract
Finding the Needle in Hundreds of Haystacks Full of Fake Needles

Have you ever thought about how differences and similarities in languages reflects their relationship to one-another? For example, the number three is similar in multiple languages (Swedish: tre, French: troi, Spanish: tres), suggesting that all of these variations originated from the same word. Also, the word ‘we’ is more similar across Germanic languages (Swedish: vi, German: wir), when compared to Romanic languages (French: nous, Spanish: nosotros, and Italian: noi). Such example shows that Germanic languages are closer related to one-another than they are to Romanic languages. Scientists exploit the same kind of similarities in DNA, to draw conclusions about the... (More)
Finding the Needle in Hundreds of Haystacks Full of Fake Needles

Have you ever thought about how differences and similarities in languages reflects their relationship to one-another? For example, the number three is similar in multiple languages (Swedish: tre, French: troi, Spanish: tres), suggesting that all of these variations originated from the same word. Also, the word ‘we’ is more similar across Germanic languages (Swedish: vi, German: wir), when compared to Romanic languages (French: nous, Spanish: nosotros, and Italian: noi). Such example shows that Germanic languages are closer related to one-another than they are to Romanic languages. Scientists exploit the same kind of similarities in DNA, to draw conclusions about the relationship amongst different species all the time, but how do they do it?

Although unreadable by us humans, the ‘code’ in which DNA is written in – adenine (A), cytosine (C), guanine (G) and thymine (T) – is more similar within species that are closely related than with those that are more distantly related. Genes are made up of strings of DNA which perform a specific function. Some of the these are crucial to our survival – such as the genes for converting the energy stored in the food we eat into a type of energy that our bodies can use – and so these tend to stay the same over time. Others (those that encode the color of your hair, for example) are more prone to change. Because genes are transferred, from one generation to the next, we can derive the relationship across different species by comparing differences and similarities in the same gene (just like we compared the same word in different languages).

Genes can make copies of themselves and these different ‘versions’ of the same gene may diverge over time and start to perform other functions. When comparing genes from different species, we must make sure that we are not comparing different copies of a gene. Another problem is that the way in which we sample species and extract DNA is imperfect, and sometimes DNA from our species of interest is contaminated with a small amount of DNA from another organism. For example, sometimes when we ‘sequence’ DNA, we obtain data not just from the animal of interest, but from its gut contents, microorganisms growing on it, or parasites living inside of it. This problem is magnified as the amount of information generated by todays machines for reading DNA is so large that it is no longer possible for scientists to manually verify each sample. If we want to investigate the relationship between species, how do we know that our samples are void of these issues?

Computer programs exist for finding genes that are similar in different species and some can even tell gene copies apart from non-copies. Few programs, however, exist for getting rid of mislabeled or mixed up DNA from different species. We have developed an entirely new program for telling gene copies apart from non-copied genes and getting rid of non-target DNA. Read more and get your copy from https://gitlab.com/fethalen/phylopypruner.

Master’s Degree Project in Biology/Molecular Biology/Bioinformatics 60 credits 2018
Department of Biology, Lund University

Advisor: Kevin M. Kocot1,2
1. Department of Biological Sciences, University of Alabama, Tuscaloosa, AL, 35487, U.S.A,
2. Alabama Museum of Natural History, University of Alabama, Tuscaloosa, AL, 35487, U.S.A. (Less)
Please use this url to cite or link to this publication:
author
Thalén, Felix
supervisor
organization
course
BINP52 20181
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
8963554
date added to LUP
2018-11-28 13:51:26
date last changed
2018-11-28 13:51:26
@misc{8963554,
  abstract     = {{Motivation: Large-scale phylogenetic analyses rely on orthology inference to curate sets of sequences related by speciation rather than gene duplication. Graph-based ‘orthology’ inference approaches cluster sequences together based on an all-versus-all BLAST, followed by filtering by hit fraction, Markov clustering, or both, but the output of such approaches often contains paralogous sequences. Tree-based orthology inference approaches solve this problem by inferring orthology based on phylogenetic trees generated from the output of such graph-based approaches. Unfortunately, contaminant sequences present in even a single taxon can cause such approaches to erroneously infer paralogy and unnecessarily exclude many sequences.
Results: We present PhyloPyPruner, a Python package for tree-based orthology inference that refines the output of a graph-based approach. The program builds on previous tools in addition to implementing new methods for differentiating and removing contamination-like sequences. Our novel paralogy frequency (PF) metric calculates the number of inferred paralogs for a given operational taxonomic unit (OTU) divided by the number of alignments in which that OTU is present. Because contamination typically results in two or more sequences from an OTU that do not form a clade in a single-gene tree, which is interpreted as paralogy by tree-based orthology inference algorithms, visualizing PF can help to identify OTUs with contamination. PhyloPyPruner can be configured to automatically remove OTUs with high PF relative to other OTUs. In addition, in cases where two or more sequences from a single OTU are present, these sequences can be removed if the maximum pairwise distance between sequences from the OTU is relatively high compared to the average pairwise distance to sequences outside of the OTU. Further, two or more sequences from different OTUs with a zero (or nearly zero) pairwise distance may be the result of crosscontamination and can removed on a per-alignment basis, based on a user-defined threshold. Finally, taxon jackknifing excludes OTUs one-by-one, during tree-based orthology inference. This enables the user to identify taxa whose exclusion improves metrics of supermatrix quality such as the number of alignments retained or percent missing data. We demonstrate the utility of PhyloPyPruner by running it on three phylogenomic datasets, recovering more genes suitable for phylogeny reconstruction, while reducing missing data and identifying and reducing contamination.}},
  author       = {{Thalén, Felix}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{PhyloPyPruner: Tree-based Orthology Inference for Phylogenomics with New Methods for Identifying and Excluding Contamination}},
  year         = {{2018}},
}