Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification

Eriksson, Pontus LU ; Marzouka, Nour-Al-Dain LU ; Sjödahl, Gottfrid LU ; Bernardo, Carina LU orcid ; Liedberg, Fredrik LU and Höglund, Mattias LU (2022) In Bioinformatics 38(4). p.1022-1029
Abstract

MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here we evaluate the behavior of... (More)

MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here we evaluate the behavior of several multiclass single sample predictors based on binary gene-pair rules (k-Top Scoring Pairs, Absolute Intrinsic Molecular Subtyping, and a new Random Forest approach) and compare them to centroids built with centered or raw expression values, with the criteria that an optimal predictor should have high accuracy, overcome differences in tumor purity, be robust across expression platforms, and provide an informative prediction output score.

RESULTS: We found that gene-pair based SSPs showed excellent performance on many expression-based classification tasks. The three methods differed in prediction score output, handling of tied scores, and behavior in low purity samples. The k-Top Scoring Pairs and Random Forest approach both achieved high classification accuracy while providing an informative prediction score. Although gene-pair-based SSPs have been touted as being cross-platform compatible (through training on mixed platform data), out-of-the-box compatibility with a new dataset remains a potential issue that warrants cohort-to-cohort verification.

AVAILABILITY: Our R package 'multiclassPairs' (https://cran.r-project.org/package=multiclassPairs) (https://doi.org/10.1093/bioinformatics/btab088) is freely available and enables easy training, prediction, and visualization using the gene-pair rule-based Random Forest SSP method and provides additional multiclass functionalities to the switchBox k-Top-Scoring Pairs package.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

(Less)
Please use this url to cite or link to this publication:
author
; ; ; ; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
in
Bioinformatics
volume
38
issue
4
pages
8 pages
publisher
Oxford University Press
external identifiers
  • pmid:34788787
  • scopus:85128991948
ISSN
1367-4803
DOI
10.1093/bioinformatics/btab763
language
English
LU publication?
yes
id
5592dfb0-d273-4a4b-a541-4fab264c0b79
date added to LUP
2021-11-23 11:39:41
date last changed
2024-06-13 17:02:21
@article{5592dfb0-d273-4a4b-a541-4fab264c0b79,
  abstract     = {{<p>MOTIVATION: Gene expression-based multiclass prediction, such as tumor subtyping, is a non-trivial bioinformatic problem. Most classifier methods operate by comparing expression levels relative to other samples. Methods that base predictions on the expression pattern within a sample have been proposed as an alternative. As these methods are invariant to the cohort composition and can be applied to a sample in isolation, they can collectively be termed single sample predictors (SSP). Such predictors could potentially be used for preprocessing-free classification of new samples and be built to function across different expression platforms where proper batch and dataset normalization is challenging. Here we evaluate the behavior of several multiclass single sample predictors based on binary gene-pair rules (k-Top Scoring Pairs, Absolute Intrinsic Molecular Subtyping, and a new Random Forest approach) and compare them to centroids built with centered or raw expression values, with the criteria that an optimal predictor should have high accuracy, overcome differences in tumor purity, be robust across expression platforms, and provide an informative prediction output score.</p><p>RESULTS: We found that gene-pair based SSPs showed excellent performance on many expression-based classification tasks. The three methods differed in prediction score output, handling of tied scores, and behavior in low purity samples. The k-Top Scoring Pairs and Random Forest approach both achieved high classification accuracy while providing an informative prediction score. Although gene-pair-based SSPs have been touted as being cross-platform compatible (through training on mixed platform data), out-of-the-box compatibility with a new dataset remains a potential issue that warrants cohort-to-cohort verification.</p><p>AVAILABILITY: Our R package 'multiclassPairs' (https://cran.r-project.org/package=multiclassPairs) (https://doi.org/10.1093/bioinformatics/btab088) is freely available and enables easy training, prediction, and visualization using the gene-pair rule-based Random Forest SSP method and provides additional multiclass functionalities to the switchBox k-Top-Scoring Pairs package.</p><p>SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.</p>}},
  author       = {{Eriksson, Pontus and Marzouka, Nour-Al-Dain and Sjödahl, Gottfrid and Bernardo, Carina and Liedberg, Fredrik and Höglund, Mattias}},
  issn         = {{1367-4803}},
  language     = {{eng}},
  month        = {{02}},
  number       = {{4}},
  pages        = {{1022--1029}},
  publisher    = {{Oxford University Press}},
  series       = {{Bioinformatics}},
  title        = {{A comparison of rule-based and centroid single-sample multiclass predictors for transcriptomic classification}},
  url          = {{http://dx.doi.org/10.1093/bioinformatics/btab763}},
  doi          = {{10.1093/bioinformatics/btab763}},
  volume       = {{38}},
  year         = {{2022}},
}