Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

SRIQ clustering : A fusion of Random Forest, QT clustering, and KNN concepts

Karlström, Jacob LU ; Aine, Mattias LU ; Staaf, Johan LU orcid and Veerla, Srinivas LU orcid (2022) In Computational and Structural Biotechnology Journal 20. p.1567-1579
Abstract

Gene expression profiling together with unsupervised analysis methods, typically clustering methods, has been used extensively in cancer research to unravel, e.g., new molecular subtypes that hold promise of disease refinement that may ultimately benefit patients. However, many of the commonly used methods require a prespecified number of clusters to extract and frequently require some type of feature pre-selection, e.g. variance filtering. This introduces subjectivity to the process of cluster discovery and the definition of putative novel tumor subtypes. Here, we introduce SRIQ, a novel unsupervised clustering method that could circumvent some of the issues in commonly used unsupervised analysis methods. SRIQ incorporates concepts... (More)

Gene expression profiling together with unsupervised analysis methods, typically clustering methods, has been used extensively in cancer research to unravel, e.g., new molecular subtypes that hold promise of disease refinement that may ultimately benefit patients. However, many of the commonly used methods require a prespecified number of clusters to extract and frequently require some type of feature pre-selection, e.g. variance filtering. This introduces subjectivity to the process of cluster discovery and the definition of putative novel tumor subtypes. Here, we introduce SRIQ, a novel unsupervised clustering method that could circumvent some of the issues in commonly used unsupervised analysis methods. SRIQ incorporates concepts from random forest machine learning as well as quality threshold- and k-nearest neighbor clustering. It is implemented as a Java and Python pipeline including data pre-processing, differential expression analysis, and pathway analysis. Using 434 lung adenocarcinomas profiled by RNA sequencing, we demonstrate the technical reproducibility of SRIQ and benchmark its performance compared to the commonly used consensus clustering method. Based on differential gene expression analysis and auxiliary molecular data we show that SRIQ can define new tumor subsets that appear biologically relevant and consistent compared and that these new subgroups seem to refine existing transcriptional subtypes that were defined using consensus clustering. Together, this provides support that SRIQ may be a useful new tool for unsupervised analysis of gene expression data from human malignancies.

(Less)
Please use this url to cite or link to this publication:
author
; ; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Clustering, Gene expression, KNN, Lung adenocarcinoma, Molecular subtypes, QT clustering, Random Forest
in
Computational and Structural Biotechnology Journal
volume
20
pages
13 pages
publisher
Research Network of Computational and Structural Biotechnology
external identifiers
  • scopus:85127468445
  • pmid:35465158
ISSN
2001-0370
DOI
10.1016/j.csbj.2022.03.036
language
English
LU publication?
yes
id
0ba7b2b9-b95c-48a1-a404-98a51025a3f2
date added to LUP
2022-05-09 14:02:49
date last changed
2024-04-13 16:38:03
@article{0ba7b2b9-b95c-48a1-a404-98a51025a3f2,
  abstract     = {{<p>Gene expression profiling together with unsupervised analysis methods, typically clustering methods, has been used extensively in cancer research to unravel, e.g., new molecular subtypes that hold promise of disease refinement that may ultimately benefit patients. However, many of the commonly used methods require a prespecified number of clusters to extract and frequently require some type of feature pre-selection, e.g. variance filtering. This introduces subjectivity to the process of cluster discovery and the definition of putative novel tumor subtypes. Here, we introduce SRIQ, a novel unsupervised clustering method that could circumvent some of the issues in commonly used unsupervised analysis methods. SRIQ incorporates concepts from random forest machine learning as well as quality threshold- and k-nearest neighbor clustering. It is implemented as a Java and Python pipeline including data pre-processing, differential expression analysis, and pathway analysis. Using 434 lung adenocarcinomas profiled by RNA sequencing, we demonstrate the technical reproducibility of SRIQ and benchmark its performance compared to the commonly used consensus clustering method. Based on differential gene expression analysis and auxiliary molecular data we show that SRIQ can define new tumor subsets that appear biologically relevant and consistent compared and that these new subgroups seem to refine existing transcriptional subtypes that were defined using consensus clustering. Together, this provides support that SRIQ may be a useful new tool for unsupervised analysis of gene expression data from human malignancies.</p>}},
  author       = {{Karlström, Jacob and Aine, Mattias and Staaf, Johan and Veerla, Srinivas}},
  issn         = {{2001-0370}},
  keywords     = {{Clustering; Gene expression; KNN; Lung adenocarcinoma; Molecular subtypes; QT clustering; Random Forest}},
  language     = {{eng}},
  pages        = {{1567--1579}},
  publisher    = {{Research Network of Computational and Structural Biotechnology}},
  series       = {{Computational and Structural Biotechnology Journal}},
  title        = {{SRIQ clustering : A fusion of Random Forest, QT clustering, and KNN concepts}},
  url          = {{http://dx.doi.org/10.1016/j.csbj.2022.03.036}},
  doi          = {{10.1016/j.csbj.2022.03.036}},
  volume       = {{20}},
  year         = {{2022}},
}