Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

SubKluster: Novel method to bin scaffolds from cereal genomes into subgenomes using substring frequency analysis

Kalbskopf, Victor (2023) BINP50 20182
Degree Projects in Bioinformatics
Abstract
The genome of the Belinda variety of the hexaploid oat (Avena sativa) has recently been sequenced and assembled. This project aims to improve the assembly by clustering the thousands of scaffolds into their three ancestral subgenomes using Principle Component Analysis (PCA) of kmer and repeat-element frequencies. The method was developed using a chromosome level assembly of hexaploid Wheat (Tritium aestivum), which formed highly distinguishable subgenome true clusters in their PCA graph, which indicates that the method has merit. The longest scaffolds of oats that formed 90% of the genome (N90) were processed in the same manner, and which resulted in 2 clusters, one with about one third of the 3-copy BUSCOs (Benchmarking Universal... (More)
The genome of the Belinda variety of the hexaploid oat (Avena sativa) has recently been sequenced and assembled. This project aims to improve the assembly by clustering the thousands of scaffolds into their three ancestral subgenomes using Principle Component Analysis (PCA) of kmer and repeat-element frequencies. The method was developed using a chromosome level assembly of hexaploid Wheat (Tritium aestivum), which formed highly distinguishable subgenome true clusters in their PCA graph, which indicates that the method has merit. The longest scaffolds of oats that formed 90% of the genome (N90) were processed in the same manner, and which resulted in 2 clusters, one with about one third of the 3-copy BUSCOs (Benchmarking Universal Single-Copy Orthologs), and another with two thirds. The latter cluster could then be subdivided into two clusters, with about half of the 2-copy BUSCOs in each cluster. A one:one:one ratio of BUSCOs in each cluster would indicate that the subgenomes are dividing into their respective clusters. The clustering is not neat or as clear as in the wheat example, but the length of the scaffolds or the state of the assembly may have a very large effect on the efficacy of the method. It is hoped that this method, with additional improvements, could be used to assess the assemblies of other large polyploid genomes and be part of a larger pipeline for understanding crop genome evolution. (Less)
Popular Abstract
Too many puzzle pieces: Oats has a messy genome

Imagine putting together a puzzle where all the puzzle pieces look very similar. And it has 12 billion pieces. And each piece has a copy. And someone mixed in two more puzzle sets that are slightly different. So you have 72 billion very similar puzzle pieces which will make 6 slightly different puzzles when assembled. Yes, even computers struggle with this. Which is why it’s taking so long to sequence the Oats genome. Yes, the cereal you eat. We’re struggling to know what’s going in it’s genome because there are 6 copies and repeating puzzle pieces.

When we try to sequence it, we get only little parts where we think we’ve managed to put together a a few thousand pieces into a fragment... (More)
Too many puzzle pieces: Oats has a messy genome

Imagine putting together a puzzle where all the puzzle pieces look very similar. And it has 12 billion pieces. And each piece has a copy. And someone mixed in two more puzzle sets that are slightly different. So you have 72 billion very similar puzzle pieces which will make 6 slightly different puzzles when assembled. Yes, even computers struggle with this. Which is why it’s taking so long to sequence the Oats genome. Yes, the cereal you eat. We’re struggling to know what’s going in it’s genome because there are 6 copies and repeating puzzle pieces.

When we try to sequence it, we get only little parts where we think we’ve managed to put together a a few thousand pieces into a fragment here or there. But we don’t know which of the 6 puzzles (genomes) each fragment belongs to. Once we know that, we can connect up the fragments into larger parts to form the right puzzle.

So I looked for patterns in the puzzle pieces, and found too many. Literally millions. Some of the patterns would be the 2 of the same letters repeated over and over, for example AGAGAGAG… However, I found a pattern in these patterns. They seemed to be found more or less depending on the which puzzle the belonged to.

I thought to see if this would work by testing it on the Wheat genome, which has a larger genome in 6 copies. We have managed to sequence wheat quite well, so I chopped up the puzzles that made up wheat so it was about the size of the fragments of oats, but kept track of which puzzle it belonged to. Then I counted how often the patterns occurred in each fragment, and saw a beautiful pattern. The patterns formed signatures that identified their parent puzzle.

I ran into problems though. Some of the oats fragments were very short, so the signatures couldn’t be found. And over the millions of years that Oats has evolved, many puzzles have swapped fragments, which means their signatures got mixed up too. However, I also know that all plants should have certain genes just to live, and I know that each puzzle should have 1 copy of those genes, so I could use that information to get the signatures back.

Thanks to machine learning, this process can be made much faster and more accurate, and we’re also learning how to get much bigger fragments from the sequencing machines, so the repeats are less of a problem. Thanks to this, we now have a pretty good idea what the oat genome looks like, and it’s just as messy as we expected.

Master’s Degree Project in Bioinformatics 30 credits
2018
Department of Biology, Lund University

Advisor: Dag Ahren
Department of Biology (Less)
Please use this url to cite or link to this publication:
author
Kalbskopf, Victor
supervisor
organization
course
BINP50 20182
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
9114079
date added to LUP
2023-05-05 16:17:38
date last changed
2023-05-05 16:17:38
@misc{9114079,
  abstract     = {{The genome of the Belinda variety of the hexaploid oat (Avena sativa) has recently been sequenced and assembled. This project aims to improve the assembly by clustering the thousands of scaffolds into their three ancestral subgenomes using Principle Component Analysis (PCA) of kmer and repeat-element frequencies. The method was developed using a chromosome level assembly of hexaploid Wheat (Tritium aestivum), which formed highly distinguishable subgenome true clusters in their PCA graph, which indicates that the method has merit. The longest scaffolds of oats that formed 90% of the genome (N90) were processed in the same manner, and which resulted in 2 clusters, one with about one third of the 3-copy BUSCOs (Benchmarking Universal Single-Copy Orthologs), and another with two thirds. The latter cluster could then be subdivided into two clusters, with about half of the 2-copy BUSCOs in each cluster. A one:one:one ratio of BUSCOs in each cluster would indicate that the subgenomes are dividing into their respective clusters. The clustering is not neat or as clear as in the wheat example, but the length of the scaffolds or the state of the assembly may have a very large effect on the efficacy of the method. It is hoped that this method, with additional improvements, could be used to assess the assemblies of other large polyploid genomes and be part of a larger pipeline for understanding crop genome evolution.}},
  author       = {{Kalbskopf, Victor}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{SubKluster: Novel method to bin scaffolds from cereal genomes into subgenomes using substring frequency analysis}},
  year         = {{2023}},
}