Optimization of mutation detection from breast cancer RNA-seq data and prediction of patient outcome

Gladchuk, Sergii

Optimization of mutation detection from breast cancer RNA-seq data and prediction of patient outcome

Mark

Gladchuk, Sergii (2018) BINP32 20172
Degree Projects in Bioinformatics

Abstract: Breast cancer is the most common type of cancer among the female population in Sweden. Genomic mutations of patients are a very important source of information for treatment choice and outcome prediction. RNA-sequencing (RNA-seq) has become the de facto standard for expression profiling and can also be used for the detection of genomic variants. However, RNA-seq-based variant detection suffers from a high rate of false positives, which hinders its utility. Herein we developed an optimized annotation pipeline and filtering strategy for processing variants derived from RNA-seq with the aim to decrease false positive calls and obtain sets of true mutations, which can be used in different applications, such as genomic targets for liquid... (More); Breast cancer is the most common type of cancer among the female population in Sweden. Genomic mutations of patients are a very important source of information for treatment choice and outcome prediction. RNA-sequencing (RNA-seq) has become the de facto standard for expression profiling and can also be used for the detection of genomic variants. However, RNA-seq-based variant detection suffers from a high rate of false positives, which hinders its utility. Herein we developed an optimized annotation pipeline and filtering strategy for processing variants derived from RNA-seq with the aim to decrease false positive calls and obtain sets of true mutations, which can be used in different applications, such as genomic targets for liquid biopsy, or prediction of patient outcome with machine learning algorithms. (Less)
Popular Abstract: Reliable variant calling – a hidden dimension of RNA sequencing

RNA sequencing (RNA-seq) is a technology to analyze the transcriptome – the expressed regions of the genome – by simultaneously reading millions of sequences derived from RNA molecules isolated from the sample(s) of interest. After aligning the obtained sequences to a reference genome they can be used to either quantify gene expression or reconstruct the transcriptome to identify modifications (e.g., gene fusions and alternative splicing). Another, albeit technically challenging and therefore rarely used, possibility is to use RNA-seq data for variant calling – the process of identifying variants (mutations) in the expressed genome. The goal of this work was to make RNA-seq... (More); Reliable variant calling – a hidden dimension of RNA sequencing

RNA sequencing (RNA-seq) is a technology to analyze the transcriptome – the expressed regions of the genome – by simultaneously reading millions of sequences derived from RNA molecules isolated from the sample(s) of interest. After aligning the obtained sequences to a reference genome they can be used to either quantify gene expression or reconstruct the transcriptome to identify modifications (e.g., gene fusions and alternative splicing). Another, albeit technically challenging and therefore rarely used, possibility is to use RNA-seq data for variant calling – the process of identifying variants (mutations) in the expressed genome. The goal of this work was to make RNA-seq based variant calling more viable by upgrading an existing general-use variant calling pipeline with rich annotations and a filters in order to obtain true genomic mutations from breast cancer RNA-seq data while removing false positive calls.

Breast cancer is the most common type of cancer among women in Sweden. As is true for all types of cancer, breast cancer is caused by mutations in the human genome. These mutations alter cell functions leading to almost unrestricted and uncontrolled growth, and at the later disease stages spread to other parts of the body. By detecting the mutations and interpreting them it may be possible to better direct the course of treatments for cancer patients and improve patient survival. Since DNA molecules are primary ‘permanent’ storage of cell genomic information and messenger-RNA (mRNA) molecules are expressed pieces of the information at any given time, sequencing of DNA (DNA-seq) is the first choice for a method to identify any ‘bugs’ (mutations) in the genome. On the other hand, RNA-seq is mostly used for the analysis of gene expression, gene fusions and alternative splicing. As one can see a lot of applications so far and still because RNA-seq is very similar to DNA-seq, it has the same one-base resolution, it can be used for mutation detection as well. One big drawback, besides the fact that mRNA represents only the expressed part of the genome, is a huge amount of falsely detected mutations. This is because mRNAs are not identical copies of DNA and it is also more challenging to align RNA sequencing reads accurately to the reference human genome. After transcription they are spliced and modified in many other ways (post-transcriptional modifications), which makes it hard to distinguish between real genomic mutations and noise caused either by the complex nature of RNA or sequencing errors.

During this project, we developed a process to annotate each mutation with rich publicly-available information. This information was subsequently used to develop filters that distinguish between true and false positive mutations. These filters allowed us to filter out all potential false mutations and reduce the initial number of the called mutations by 200-fold. The resulting set of RNA-seq mutations was validated in different ways. One of them was to check if the majority of DNA-seq mutations in the PIK3CA gene (one of the most mutated genes in breast cancer) were identified from RNA-seq data. To our great satisfaction, 71% of DNA-seq mutation were called from RNA-seq data and survived the filter. This result is presented in Figure 1, where the top lollipops represent mutations derived from RNA-seq and bottom ones from DNA-seq.

Another direction of the project was to develop a model for patient survival prediction based on the mutations identified by our pipeline and by using machine-learning methods. The model performance (accuracy of prediction) was not high with possible explanations that patient follow-up times were still short and mutations alone cannot do very good prediction.

The variant calling pipeline upgraded with the annotation pipeline and filters were also used in other breast cancer studies with RNA-seq data. A subset of potentially true mutations identified by the pipeline were successfully detected in DNA molecules with very sensitive laboratory methods. These experiments gave us independent reassurance that our pipeline implementation can correctly identify genomic mutations -- the hidden dimension of RNA-seq.

Master’s Degree Project in Bioinformatics 60 credits 2018
Department of Biology, Lund University

Advisors: Lao Saal, Christian Brueffer
Department of Clinical Sciences, Lund University, Medicon Village, Building 404-B2,
SE-22381, Lund, Sweden (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/8948028

author

Gladchuk, Sergii

supervisor

Lao Saal ^LU
Christian Brueffer ^LU

organization

Degree Projects in Bioinformatics

course

BINP32 20172

year

2018

type

H2 - Master's Degree (Two Years)

subject

Biology and Life Sciences

language

English

id

8948028

date added to LUP

2018-06-11 15:48:18

date last changed

2018-06-11 15:48:18

@misc{8948028,
  abstract     = {{Breast cancer is the most common type of cancer among the female population in Sweden. Genomic mutations of patients are a very important source of information for treatment choice and outcome prediction. RNA-sequencing (RNA-seq) has become the de facto standard for expression profiling and can also be used for the detection of genomic variants. However, RNA-seq-based variant detection suffers from a high rate of false positives, which hinders its utility. Herein we developed an optimized annotation pipeline and filtering strategy for processing variants derived from RNA-seq with the aim to decrease false positive calls and obtain sets of true mutations, which can be used in different applications, such as genomic targets for liquid biopsy, or prediction of patient outcome with machine learning algorithms.}},
  author       = {{Gladchuk, Sergii}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Optimization of mutation detection from breast cancer RNA-seq data and prediction of patient outcome}},
  year         = {{2018}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Optimization of mutation detection from breast cancer RNA-seq data and prediction of patient outcome