Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Pipeline for metagenomics-based population genomics

Delgado, Fernando (2020) BINP51 20201
Degree Projects in Bioinformatics
Abstract
Bacterioplankton is a fundamental component of the marine ecosystem. They contribute significantly to the primary production and carbon fixation, and to the nutrients and elements recycling. With shotgun metagenomics and genomic binning, the genomes of individual species can be assembled (Metagenome-Assembled Genomes - MAGs) without cultivation. By analysing the genomes, one can gain insights into the functional capabilities of the organisms. The genomes can also be used to study how the populations of individual species are genetically structured in time and space and in relation to various environmental factors. When metagenomics is carried out on an environmental sample, allele frequencies of single-nucleotide polymorphisms (SNPs) can... (More)
Bacterioplankton is a fundamental component of the marine ecosystem. They contribute significantly to the primary production and carbon fixation, and to the nutrients and elements recycling. With shotgun metagenomics and genomic binning, the genomes of individual species can be assembled (Metagenome-Assembled Genomes - MAGs) without cultivation. By analysing the genomes, one can gain insights into the functional capabilities of the organisms. The genomes can also be used to study how the populations of individual species are genetically structured in time and space and in relation to various environmental factors. When metagenomics is carried out on an environmental sample, allele frequencies of single-nucleotide polymorphisms (SNPs) can be obtained, by mapping the reads to a reference genome. By analysing several samples, SNP patterns between these can be compared and used to infer population structures. Recently, a computer program POGENOM was developed that estimates several population genetic parameters for a genome in relation to a set of samples. In this thesis, a pipeline for automatically generating the required input to POGENOM was developed, named Input_POGENOM. The pipeline increases the reproducibility of the data analysis and simplifies the use of POGENOM. Its performance was validated by completing a comparative study of genome-level diversity and differentiation of MAGs of the Baltic Sea, using metagenomic data from samples spanning various environmental gradients of the Baltic Sea and three samples of the Caspian Sea. (Less)
Popular Abstract
Pipeline for metagenomics-based population genomics

Each litre of seawater contains around 1 billion microscopic bacterioplankton. Although invisible, these organisms are fundamental components of the marine ecosystem by contributing significantly to the primary production and carbon fixation, and by recycling nutrients and elements. Most bacterioplankton are very hard to cultivate. Still, by using shotgun metagenomics and bioinformatic tools, one can assemble the genomes of individual species without cultivation, and by analysing their genomes, one can gain insights into their functional capabilities. The genomes can also be used to study how the populations of individual species are structured in time and space and in relation to... (More)
Pipeline for metagenomics-based population genomics

Each litre of seawater contains around 1 billion microscopic bacterioplankton. Although invisible, these organisms are fundamental components of the marine ecosystem by contributing significantly to the primary production and carbon fixation, and by recycling nutrients and elements. Most bacterioplankton are very hard to cultivate. Still, by using shotgun metagenomics and bioinformatic tools, one can assemble the genomes of individual species without cultivation, and by analysing their genomes, one can gain insights into their functional capabilities. The genomes can also be used to study how the populations of individual species are structured in time and space and in relation to various environmental factors.

When metagenomics is conducted on an environmental sample, the shotgun reads are derived from thousands to millions of individual cells of each species. By mapping the reads to a reference genome, allele frequencies of single-nucleotide polymorphisms (SNPs) within the sample can be obtained. By analysing several samples, SNP patterns among these can be compared and used to infer population structures. A computer program POGENOM was recently developed in the Andersson’s group that calculates several population genetic parameters for a genome in relation to a set of samples. POGENOM takes as minimal input a file of the variant call format (VCF). This input file is generated by mapping one or several metagenomic samples (paired-read files) against a reference genome with a read-aligner, and calling variants using a variant caller. To increase the reproducibility of the data analysis, and to simplify the use of POGENOM, a pipeline for generating the required input is highly needed, especially for large datasets. The aims of this project was: (1) To develop a pipeline generating the input files for POGENOM, (2) to validate the pipeline performance by completing a population comparative study of genome-level diversity and differentiation of 43 Metagenomic-Assembled Genomes (MAGs) of the Baltic Sea, using metagenomic data from 65 samples spanning various environmental gradients (e.g., salinity and temperature) of the Baltic Sea, and 3 samples from the Caspian Sea.

The developed pipeline, Input_POGENOM, can produce the required input files for POGENOM for multiple genomes in parallel and will for each genome base the variant calling only on those metagenome samples that have coverage depth and breadth above user-specified values. The pipeline can also do a quick pre-screening by mapping a subset of the reads from each sample. Then, it estimates the coverage of the samples, and determines which should be included. This pre-screening step aims to significantly reduce the analysis runtime, by preventing the full mapping of irrelevant samples. For the mapped samples, the pipeline can down-sample to a target median coverage, to avoid biases due to uneven coverage. Parameters can be easily modified in the pipeline configuration file. Additionally, the pipeline has been fully documented and it is public available.

In summary, we developed a pipeline that considerably simplifies the use of POGENOM and increases the reproducibility of the data analysis and demonstrated the utility of this pipeline by revealing population structures correlating with environmental factors in Baltic Sea bacteria.

Master’s Degree Project in Bioinformatics BINP50 credits 30.
Department of Biology, Lund University
Advisor: Anders F. Andersson
KTH Royal Institute of Technology, School of Engineering Sciences in Chemistry, Biotechnology and Health, Department of Gene Technology, Science for Life Laboratory, Stockholm, Sweden. (Less)
Please use this url to cite or link to this publication:
author
Delgado, Fernando
supervisor
organization
course
BINP51 20201
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
9039601
date added to LUP
2021-02-05 11:22:40
date last changed
2021-02-05 11:22:40
@misc{9039601,
  abstract     = {{Bacterioplankton is a fundamental component of the marine ecosystem. They contribute significantly to the primary production and carbon fixation, and to the nutrients and elements recycling. With shotgun metagenomics and genomic binning, the genomes of individual species can be assembled (Metagenome-Assembled Genomes - MAGs) without cultivation. By analysing the genomes, one can gain insights into the functional capabilities of the organisms. The genomes can also be used to study how the populations of individual species are genetically structured in time and space and in relation to various environmental factors. When metagenomics is carried out on an environmental sample, allele frequencies of single-nucleotide polymorphisms (SNPs) can be obtained, by mapping the reads to a reference genome. By analysing several samples, SNP patterns between these can be compared and used to infer population structures. Recently, a computer program POGENOM was developed that estimates several population genetic parameters for a genome in relation to a set of samples. In this thesis, a pipeline for automatically generating the required input to POGENOM was developed, named Input_POGENOM. The pipeline increases the reproducibility of the data analysis and simplifies the use of POGENOM. Its performance was validated by completing a comparative study of genome-level diversity and differentiation of MAGs of the Baltic Sea, using metagenomic data from samples spanning various environmental gradients of the Baltic Sea and three samples of the Caspian Sea.}},
  author       = {{Delgado, Fernando}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Pipeline for metagenomics-based population genomics}},
  year         = {{2020}},
}