Advanced

Optimization of mutation detection from breast cancer RNA-seq data and prediction of patient outcome

Gladchuk, Sergii (2018) BINP32 20172
Degree Projects in Bioinformatics
Abstract
Breast cancer is the most common type of cancer among the female population in Sweden. Genomic mutations of patients are a very important source of information for treatment choice and outcome prediction. RNA-sequencing (RNA-seq) has become the de facto standard for expression profiling and can also be used for the detection of genomic variants. However, RNA-seq-based variant detection suffers from a high rate of false positives, which hinders its utility. Herein we developed an optimized annotation pipeline and filtering strategy for processing variants derived from RNA-seq with the aim to decrease false positive calls and obtain sets of true mutations, which can be used in different applications, such as genomic targets for liquid... (More)
Breast cancer is the most common type of cancer among the female population in Sweden. Genomic mutations of patients are a very important source of information for treatment choice and outcome prediction. RNA-sequencing (RNA-seq) has become the de facto standard for expression profiling and can also be used for the detection of genomic variants. However, RNA-seq-based variant detection suffers from a high rate of false positives, which hinders its utility. Herein we developed an optimized annotation pipeline and filtering strategy for processing variants derived from RNA-seq with the aim to decrease false positive calls and obtain sets of true mutations, which can be used in different applications, such as genomic targets for liquid biopsy, or prediction of patient outcome with machine learning algorithms. (Less)
Popular Abstract
Reliable variant calling – a hidden dimension of RNA sequencing

RNA sequencing (RNA-seq) is a technology to analyze the transcriptome – the expressed regions of the genome – by simultaneously reading millions of sequences derived from RNA molecules isolated from the sample(s) of interest. After aligning the obtained sequences to a reference genome they can be used to either quantify gene expression or reconstruct the transcriptome to identify modifications (e.g., gene fusions and alternative splicing). Another, albeit technically challenging and therefore rarely used, possibility is to use RNA-seq data for variant calling – the process of identifying variants (mutations) in the expressed genome. The goal of this work was to make RNA-seq... (More)
Reliable variant calling – a hidden dimension of RNA sequencing

RNA sequencing (RNA-seq) is a technology to analyze the transcriptome – the expressed regions of the genome – by simultaneously reading millions of sequences derived from RNA molecules isolated from the sample(s) of interest. After aligning the obtained sequences to a reference genome they can be used to either quantify gene expression or reconstruct the transcriptome to identify modifications (e.g., gene fusions and alternative splicing). Another, albeit technically challenging and therefore rarely used, possibility is to use RNA-seq data for variant calling – the process of identifying variants (mutations) in the expressed genome. The goal of this work was to make RNA-seq based variant calling more viable by upgrading an existing general-use variant calling pipeline with rich annotations and a filters in order to obtain true genomic mutations from breast cancer RNA-seq data while removing false positive calls.

Breast cancer is the most common type of cancer among women in Sweden. As is true for all types of cancer, breast cancer is caused by mutations in the human genome. These mutations alter cell functions leading to almost unrestricted and uncontrolled growth, and at the later disease stages spread to other parts of the body. By detecting the mutations and interpreting them it may be possible to better direct the course of treatments for cancer patients and improve patient survival. Since DNA molecules are primary ‘permanent’ storage of cell genomic information and messenger-RNA (mRNA) molecules are expressed pieces of the information at any given time, sequencing of DNA (DNA-seq) is the first choice for a method to identify any ‘bugs’ (mutations) in the genome. On the other hand, RNA-seq is mostly used for the analysis of gene expression, gene fusions and alternative splicing. As one can see a lot of applications so far and still because RNA-seq is very similar to DNA-seq, it has the same one-base resolution, it can be used for mutation detection as well. One big drawback, besides the fact that mRNA represents only the expressed part of the genome, is a huge amount of falsely detected mutations. This is because mRNAs are not identical copies of DNA and it is also more challenging to align RNA sequencing reads accurately to the reference human genome. After transcription they are spliced and modified in many other ways (post-transcriptional modifications), which makes it hard to distinguish between real genomic mutations and noise caused either by the complex nature of RNA or sequencing errors.

During this project, we developed a process to annotate each mutation with rich publicly-available information. This information was subsequently used to develop filters that distinguish between true and false positive mutations. These filters allowed us to filter out all potential false mutations and reduce the initial number of the called mutations by 200-fold. The resulting set of RNA-seq mutations was validated in different ways. One of them was to check if the majority of DNA-seq mutations in the PIK3CA gene (one of the most mutated genes in breast cancer) were identified from RNA-seq data. To our great satisfaction, 71% of DNA-seq mutation were called from RNA-seq data and survived the filter. This result is presented in Figure 1, where the top lollipops represent mutations derived from RNA-seq and bottom ones from DNA-seq.

Another direction of the project was to develop a model for patient survival prediction based on the mutations identified by our pipeline and by using machine-learning methods. The model performance (accuracy of prediction) was not high with possible explanations that patient follow-up times were still short and mutations alone cannot do very good prediction.

The variant calling pipeline upgraded with the annotation pipeline and filters were also used in other breast cancer studies with RNA-seq data. A subset of potentially true mutations identified by the pipeline were successfully detected in DNA molecules with very sensitive laboratory methods. These experiments gave us independent reassurance that our pipeline implementation can correctly identify genomic mutations -- the hidden dimension of RNA-seq.


Master’s Degree Project in Bioinformatics 60 credits 2018
Department of Biology, Lund University

Advisors: Lao Saal, Christian Brueffer
Department of Clinical Sciences, Lund University, Medicon Village, Building 404-B2,
SE-22381, Lund, Sweden (Less)
Please use this url to cite or link to this publication:
author
Gladchuk, Sergii
supervisor
organization
course
BINP32 20172
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
8948028
date added to LUP
2018-06-11 15:48:18
date last changed
2018-06-11 15:48:18
@misc{8948028,
  abstract     = {Breast cancer is the most common type of cancer among the female population in Sweden. Genomic mutations of patients are a very important source of information for treatment choice and outcome prediction. RNA-sequencing (RNA-seq) has become the de facto standard for expression profiling and can also be used for the detection of genomic variants. However, RNA-seq-based variant detection suffers from a high rate of false positives, which hinders its utility. Herein we developed an optimized annotation pipeline and filtering strategy for processing variants derived from RNA-seq with the aim to decrease false positive calls and obtain sets of true mutations, which can be used in different applications, such as genomic targets for liquid biopsy, or prediction of patient outcome with machine learning algorithms.},
  author       = {Gladchuk, Sergii},
  language     = {eng},
  note         = {Student Paper},
  title        = {Optimization of mutation detection from breast cancer RNA-seq data and prediction of patient outcome},
  year         = {2018},
}