A Novel Method for Predicting Ribosomal RNA Genes in Prokaryotic Genomes

Loman, Torkel

A Novel Method for Predicting Ribosomal RNA Genes in Prokaryotic Genomes

Mark

Loman, Torkel (2017) BINP30 20161
Degree Projects in Bioinformatics

Abstract: With the increased efficiency of sequencing methods, large quantities of genomic information are quickly becoming available. To make use of this, methods are needed that can discern useful information from these vast data quantities. Such information could be the presence and position of specified genes with known properties. Due to their highly conserved sequences and their prevalence across all genomes the ribosomal RNA (rRNA) genes have a wide range of application within bioinformatics. Previous methods for predicting rRNA genes include RNAmmer and barrnap which both use an approach based on Hided Markov Models (HMM). However these methods are problematic due to a number of reason. Here we present a new method for rRNA gene prediction... (More); With the increased efficiency of sequencing methods, large quantities of genomic information are quickly becoming available. To make use of this, methods are needed that can discern useful information from these vast data quantities. Such information could be the presence and position of specified genes with known properties. Due to their highly conserved sequences and their prevalence across all genomes the ribosomal RNA (rRNA) genes have a wide range of application within bioinformatics. Previous methods for predicting rRNA genes include RNAmmer and barrnap which both use an approach based on Hided Markov Models (HMM). However these methods are problematic due to a number of reason. Here we present a new method for rRNA gene prediction that uses a new, k-mer based, approach. This method provides a large improvement in specificity for predicting 5S rRNA genes as well as higher precision in pinpointing the ends of the rRNA gene. At the same time it preserves the high sensitivity of earlier methods with a decrease in running time. We also demonstrate the ability of rRNA gene predictors to find potential errors in the RefSeq annotation database. (Less)
Popular Abstract: Rapid and Comprehensive Diagnosis of Bacterial Infections

The first step of medical treatment is to make a diagnosis that is determining the cause of the patient's problems. In case of a broken leg this is easy, but if instead a patient arrives to the hospital complaining about nausea, fatigue and headache after a trip to Thailand it is more difficult. Up until recently it has been up to the medical expertise of the doctor to determine the cause of the symptoms. However, with modern technology it has become increasingly possible to automate the diagnosis process. One such advance concerns bacterial identification. New methods are developed to enable us to identify all the bacteria present in a sample. Since bacteria cause many diseases... (More); Rapid and Comprehensive Diagnosis of Bacterial Infections

The first step of medical treatment is to make a diagnosis that is determining the cause of the patient's problems. In case of a broken leg this is easy, but if instead a patient arrives to the hospital complaining about nausea, fatigue and headache after a trip to Thailand it is more difficult. Up until recently it has been up to the medical expertise of the doctor to determine the cause of the symptoms. However, with modern technology it has become increasingly possible to automate the diagnosis process. One such advance concerns bacterial identification. New methods are developed to enable us to identify all the bacteria present in a sample. Since bacteria cause many diseases it should be possible to determine if a bacterial infection is the cause of a patients suffering with a small and simple test.

Using Genes for Species Identification
This method is based on DNA technology. As you might know the DNA is what defines each individual (our biological blueprint). As well as small DNA differences between individuals (that's why we all look different) there are larger differences between different species. The genes are the pieces of DNA that govern certain characteristics. For example there would be a gene for hairiness. In humans that gene would cause a relative lack of hair growth, while the horse's gene would cause a medium amount of hair and a lot of it in a Yak. A snake would lack this gene altogether. Now with modern DNA technology it is possible to extract DNA from any biological sample (like blood, skin, saliva...) and identify the genes present in that sample. If you have blood from a human, a horse and a yak it is possible to find the gene for hairiness. If you then got the samples mixed up it would be possible to label them according to which sample have genes for little (human), medium (horse) and a lot (yak) of hair growth. If the samples were from a snake and a frog you would instead have to use some other gene. It is exactly this same principle we use when we want to identify which bacteria are currently infecting a patient.

Identifying Infectious Bacteria
When we want to diagnose a bacterial infection we will first need a target gene. This needs to be a gene that can be found in all bacteria (or else we will get the snake and frog problem for hairiness). In practise this will be the so called 16S rRNA gene (which is responsible for protein production. Since all organisms needs proteins they also all have this gene for producing them). We also need a bit of preparation, namely we will need to know how this 16S rRNA gene looks for all possible bacterias that we want to identify. This leads us to another advantage with the 16S rRNA gene, it has been identified for almost all bacteria. From this we can build our diagnostic tool. The first step is to take a sample, which will have to be done by the doctor or nurse. Next that sample is put into some "Magic Machine". The machine first extracts all DNA in the sample and send information about it to a computer. This computer will then identify all 16S rRNA genes in the sample. All these genes will then be matched against a database of known 16S rRNA genes. For each gene one can thus identify from which bacterium that gene originated from. One can then conclude that that bacteria must have been present in the patient and is a possible cause of disease.

Finding 16S rRNA Genes
This process requires a few steps to work; DNA extraction, 16S rRNA identification, 16S rRNA database construction and finally matching the 16S rRNA genes found to the database. What I have developed is the method for finding 16S rRNA genes in DNA data. This method does not say which bacteria the gene originate from (that is the next step) but only that it is a 16S rRNA gene which then can be matched to the database. How does my method do that? It utilises the database of 16S rRNA gene to compress information of how a 16S rRNA gene looks like. Then it can quickly scan the DNA with that information to find matches. These matches are 16S rRNA genes. But if we are going to use the database of all bacterial 16S rRNA genes to identify how a 16S rRNA gene looks like, can't we just compare all the DNA sequences in the sample to all the 16S rRNA sequences in the database directly? And if so, what use is there for the method I created? Yes it is possible to do that! But no, it is a terrible idea! Computer are limited in how fast they can do things, and this would take a few weeks. The extra step that I have made can reduce the time to only minutes, and if you are dealing with a plague victim that is a very good thing.

Master’s Degree Project in Bioinformatics 30 credits 2016/17
Department of Biology, Lund University

Advisor: Björn Canbäck (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8914064

author

Loman, Torkel

supervisor

Björn Canbäck ^LU

organization

Degree Projects in Bioinformatics

course

BINP30 20161

year

2017

type

H2 - Master's Degree (Two Years)

subject

Biology and Life Sciences

language

English

id

8914064

date added to LUP

2017-06-12 11:47:03

date last changed

2017-06-12 11:47:03

@misc{8914064,
  abstract     = {{With the increased efficiency of sequencing methods, large quantities of genomic information are quickly becoming available. To make use of this, methods are needed that can discern useful information from these vast data quantities. Such information could be the presence and position of specified genes with known properties. Due to their highly conserved sequences and their prevalence across all genomes the ribosomal RNA (rRNA) genes have a wide range of application within bioinformatics. Previous methods for predicting rRNA genes include RNAmmer and barrnap which both use an approach based on Hided Markov Models (HMM). However these methods are problematic due to a number of reason. Here we present a new method for rRNA gene prediction that uses a new, k-mer based, approach. This method provides a large improvement in specificity for predicting 5S rRNA genes as well as higher precision in pinpointing the ends of the rRNA gene. At the same time it preserves the high sensitivity of earlier methods with a decrease in running time. We also demonstrate the ability of rRNA gene predictors to find potential errors in the RefSeq annotation database.}},
  author       = {{Loman, Torkel}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{A Novel Method for Predicting Ribosomal RNA Genes in Prokaryotic Genomes}},
  year         = {{2017}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

A Novel Method for Predicting Ribosomal RNA Genes in Prokaryotic Genomes