Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification

Hafstað, Völundur LU ; Häkkinen, Jari LU orcid ; Larsson, Malin LU ; Staaf, Johan LU orcid ; Vallon-Christersson, Johan LU orcid and Persson, Helena LU orcid (2023) In BMC Genomics 24(1).
Abstract

BACKGROUND: Genomic rearrangements in cancer cells can create fusion genes that encode chimeric proteins or alter the expression of coding and non-coding RNAs. In some cancer types, fusions involving specific kinases are used as targets for therapy. Fusion genes can be detected by whole genome sequencing (WGS) and targeted fusion panels, but RNA sequencing (RNA-Seq) has the advantageous capability of broadly detecting expressed fusion transcripts.

RESULTS: We developed a pipeline for validation of fusion transcripts identified in RNA-Seq data using matched WGS data from The Cancer Genome Atlas (TCGA) and applied it to 910 tumors from 11 different cancer types. This resulted in 4237 validated gene fusions, 3049 of them with at... (More)

BACKGROUND: Genomic rearrangements in cancer cells can create fusion genes that encode chimeric proteins or alter the expression of coding and non-coding RNAs. In some cancer types, fusions involving specific kinases are used as targets for therapy. Fusion genes can be detected by whole genome sequencing (WGS) and targeted fusion panels, but RNA sequencing (RNA-Seq) has the advantageous capability of broadly detecting expressed fusion transcripts.

RESULTS: We developed a pipeline for validation of fusion transcripts identified in RNA-Seq data using matched WGS data from The Cancer Genome Atlas (TCGA) and applied it to 910 tumors from 11 different cancer types. This resulted in 4237 validated gene fusions, 3049 of them with at least one identified genomic breakpoint. Utilizing validated fusions as true positive events, we trained a machine learning classifier to predict true and false positive fusion transcripts from RNA-Seq data. The final precision and recall metrics of the classifier were 0.74 and 0.71, respectively, in an independent dataset of 249 breast tumors. Application of this classifier to all samples with RNA-Seq data from these cancer types vastly extended the number of likely true positive fusion transcripts and identified many potentially targetable kinase fusions. Further analysis of the validated gene fusions suggested that many are created by intrachromosomal amplification events with microhomology-mediated non-homologous end-joining.

CONCLUSIONS: A classifier trained on validated fusion events increased the accuracy of fusion transcript identification in samples without WGS data. This allowed the analysis to be extended to all samples with RNA-Seq data, facilitating studies of tumor biology and increasing the number of detected kinase fusions. Machine learning could thus be used in identification of clinically relevant fusion events for targeted therapy. The large dataset of validated gene fusions generated here presents a useful resource for development and evaluation of fusion transcript detection algorithms.

(Less)
Please use this url to cite or link to this publication:
author
; ; ; ; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
in
BMC Genomics
volume
24
issue
1
article number
783
publisher
BioMed Central (BMC)
external identifiers
  • scopus:85180200111
  • pmid:38110872
ISSN
1471-2164
DOI
10.1186/s12864-023-09889-y
language
English
LU publication?
yes
additional info
© 2023. The Author(s).
id
25584de2-08e7-4009-8cc6-87d45a4429b7
date added to LUP
2023-12-20 09:08:12
date last changed
2024-04-19 07:02:58
@article{25584de2-08e7-4009-8cc6-87d45a4429b7,
  abstract     = {{<p>BACKGROUND: Genomic rearrangements in cancer cells can create fusion genes that encode chimeric proteins or alter the expression of coding and non-coding RNAs. In some cancer types, fusions involving specific kinases are used as targets for therapy. Fusion genes can be detected by whole genome sequencing (WGS) and targeted fusion panels, but RNA sequencing (RNA-Seq) has the advantageous capability of broadly detecting expressed fusion transcripts.</p><p>RESULTS: We developed a pipeline for validation of fusion transcripts identified in RNA-Seq data using matched WGS data from The Cancer Genome Atlas (TCGA) and applied it to 910 tumors from 11 different cancer types. This resulted in 4237 validated gene fusions, 3049 of them with at least one identified genomic breakpoint. Utilizing validated fusions as true positive events, we trained a machine learning classifier to predict true and false positive fusion transcripts from RNA-Seq data. The final precision and recall metrics of the classifier were 0.74 and 0.71, respectively, in an independent dataset of 249 breast tumors. Application of this classifier to all samples with RNA-Seq data from these cancer types vastly extended the number of likely true positive fusion transcripts and identified many potentially targetable kinase fusions. Further analysis of the validated gene fusions suggested that many are created by intrachromosomal amplification events with microhomology-mediated non-homologous end-joining.</p><p>CONCLUSIONS: A classifier trained on validated fusion events increased the accuracy of fusion transcript identification in samples without WGS data. This allowed the analysis to be extended to all samples with RNA-Seq data, facilitating studies of tumor biology and increasing the number of detected kinase fusions. Machine learning could thus be used in identification of clinically relevant fusion events for targeted therapy. The large dataset of validated gene fusions generated here presents a useful resource for development and evaluation of fusion transcript detection algorithms.</p>}},
  author       = {{Hafstað, Völundur and Häkkinen, Jari and Larsson, Malin and Staaf, Johan and Vallon-Christersson, Johan and Persson, Helena}},
  issn         = {{1471-2164}},
  language     = {{eng}},
  month        = {{12}},
  number       = {{1}},
  publisher    = {{BioMed Central (BMC)}},
  series       = {{BMC Genomics}},
  title        = {{Improved detection of clinically relevant fusion transcripts in cancer by machine learning classification}},
  url          = {{http://dx.doi.org/10.1186/s12864-023-09889-y}},
  doi          = {{10.1186/s12864-023-09889-y}},
  volume       = {{24}},
  year         = {{2023}},
}