Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Development and comparison of feature-selection pipelines for predicting patient outcome from RNA-seq data in breast cancer

Chi, Xu (2022) BINP52 20211
Degree Projects in Bioinformatics
Abstract
In 2020, breast cancer became the most common type of cancer worldwide. While treatment and technology have radically improved, leading to high 5-year survival, the long-term survival of breast cancer is still poor. Gene expression profiling of patients is an important source of information for outcome prediction. The Sweden Cancerome Analysis Network-Breast (SCAN-B) was initiated in 2009 and provides rich RNA-sequencing (RNA-seq) data for breast cancer research. Herein, we tested a de novo RNA-seq sequence data assembly pipeline on two aggregation levels of expression: gene and transcript. With the aim to select informative features associated with patient outcomes, a customized ensemble feature selection model based on penalized Cox... (More)
In 2020, breast cancer became the most common type of cancer worldwide. While treatment and technology have radically improved, leading to high 5-year survival, the long-term survival of breast cancer is still poor. Gene expression profiling of patients is an important source of information for outcome prediction. The Sweden Cancerome Analysis Network-Breast (SCAN-B) was initiated in 2009 and provides rich RNA-sequencing (RNA-seq) data for breast cancer research. Herein, we tested a de novo RNA-seq sequence data assembly pipeline on two aggregation levels of expression: gene and transcript. With the aim to select informative features associated with patient outcomes, a customized ensemble feature selection model based on penalized Cox proportional hazards (PH) model was developed and compared with the univariate method and single-run Cox PH model. This method can be employed to select features in unbalanced, ultra-high-dimensional, time-to-event data. In recurrence-free interval prediction, the transcript features selected with univariate method gave the best performance (the Uno’s concordance index: 0.825), and followed by gene features selected with customized ensemble method (the Uno’s concordance index: 0.819). (Less)
Popular Abstract
Development and comparison of feature-selection pipelines for predicting patient outcome from RNA-seq data in breast cancer

Breast cancer is the most common cancer worldwide. In western world, breast cancer has a high 5-year survival rate of above 90%. However, late relapses after 5-years are relatively common. The Sweden Cancerome Analysis Network-Breast (SCAN-B) was initiated in 2009 and began enrolling patients in 2010. Within SCAN-B, breast tumors from hospitals across a wide geography of Sweden are routinely being processed and RNA-sequenced, a technology that can evaluate the quantity and sequences of RNA in a sample. The purpose of this project is to develop and compare a set of pipelines that can select a handful of meaningful... (More)
Development and comparison of feature-selection pipelines for predicting patient outcome from RNA-seq data in breast cancer

Breast cancer is the most common cancer worldwide. In western world, breast cancer has a high 5-year survival rate of above 90%. However, late relapses after 5-years are relatively common. The Sweden Cancerome Analysis Network-Breast (SCAN-B) was initiated in 2009 and began enrolling patients in 2010. Within SCAN-B, breast tumors from hospitals across a wide geography of Sweden are routinely being processed and RNA-sequenced, a technology that can evaluate the quantity and sequences of RNA in a sample. The purpose of this project is to develop and compare a set of pipelines that can select a handful of meaningful genes or transcripts out of thousands and use their expression values to predict the death or recurrence in breast cancer patients.

Firstly, a de novo assembly pipeline was developed aiming to reconstruct the novel transcripts using short nucleotide sequences. Using 500 random SCAN-B samples and two genes of specific interest, the developed pipeline can detect the samples containing novel transcripts more sensitively compared to the previous pipeline. And the newly constructed transcript could show a significant difference between the four subtypes of breast cancer.

Afterward, based on 2874 SCAN-B samples with estrogen receptor (ER) positive and human epidermal growth factor receptor 2 (HER2) negative receptor status, the de novo annotation file was generated, which contained the information of novel transcripts. SCAN-B's original pipeline uses GENCODE version 27 as the annotation file to instruct the assembly and quantification of transcripts. In order to compare with the original pipeline, this project used de novo annotation file and GENCODE version 32 to estimate the gene and transcript expression. Three feature selection methods were performed on these datasets after extracting the gene and transcript expression value with a customized script.

A customized feature section method inspired by ensemble learning was developed. Compared with a univariate method that tests the feature one by one and the penalized Cox model that evaluates the multiple features at the same time, our customized method can be used to select significant features in unbalanced, ultra-high-dimensional, time-to-event data.

In conclusion, it is possible to predict the patient outcomes (recurrence and/or death) using transcript expression levels. In recurrence-free interval (RFi) prediction, the transcript features selected with univariate method gave the best performance (the Uno's concordance index: 0.825) and followed by gene features selected with customized ensemble method (the Uno's concordance index: 0.819). This project forms a basis for further work on developing predictive signatures for breast cancer patient outcomes.


Master’s Degree Project in Bioinformatics 60 credits 2022
Department of Biology, Lund University

Advisor: Lao Saal
Translational Oncogenomics Unit, Lund university (Less)
Please use this url to cite or link to this publication:
author
Chi, Xu
supervisor
organization
course
BINP52 20211
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
9077458
date added to LUP
2022-03-22 13:36:24
date last changed
2022-03-22 13:36:24
@misc{9077458,
  abstract     = {{In 2020, breast cancer became the most common type of cancer worldwide. While treatment and technology have radically improved, leading to high 5-year survival, the long-term survival of breast cancer is still poor. Gene expression profiling of patients is an important source of information for outcome prediction. The Sweden Cancerome Analysis Network-Breast (SCAN-B) was initiated in 2009 and provides rich RNA-sequencing (RNA-seq) data for breast cancer research. Herein, we tested a de novo RNA-seq sequence data assembly pipeline on two aggregation levels of expression: gene and transcript. With the aim to select informative features associated with patient outcomes, a customized ensemble feature selection model based on penalized Cox proportional hazards (PH) model was developed and compared with the univariate method and single-run Cox PH model. This method can be employed to select features in unbalanced, ultra-high-dimensional, time-to-event data. In recurrence-free interval prediction, the transcript features selected with univariate method gave the best performance (the Uno’s concordance index: 0.825), and followed by gene features selected with customized ensemble method (the Uno’s concordance index: 0.819).}},
  author       = {{Chi, Xu}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Development and comparison of feature-selection pipelines for predicting patient outcome from RNA-seq data in breast cancer}},
  year         = {{2022}},
}