A community effort to identify and correct mislabeled samples in proteogenomic studies

Yoo, Seungyeul; Shi, Zhiao; Wen, Bo; Kho, SoonJye; Pan, Renke; Feng, Hanying; Chen, Hong; Carlsson, Anders; Edén, Patrik; Ma, Weiping; Raymer, Michael; Maier, Ezekiel J.; Tezak, Zivana; Johansson, Elaine; Hinton, Denise; Rodriguez, Henry; Zhu, Jun; Boja, Emily; Wang, Pei; Zhang, Bing

A community effort to identify and correct mislabeled samples in proteogenomic studies

Mark

Yoo, Seungyeul ; Shi, Zhiao ; Wen, Bo ; Kho, SoonJye ; Pan, Renke ; Feng, Hanying ; Chen, Hong ; Carlsson, Anders ^LU ; Edén, Patrik ^LU and Ma, Weiping , et al. (2021) In Patterns 2(5).

Abstract: Sample mislabeling or misannotation has been a long-standing problem in scientific research, particularly prevalent in large-scale, multi-omic studies due to the complexity of multi-omic workflows. There exists an urgent need for implementing quality controls to automatically screen for and correct sample mislabels or misannotations in multi-omic studies. Here, we describe a crowdsourced precisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge, which provides a framework for systematic benchmarking and evaluation of mislabel identification and correction methods for integrative proteogenomic studies. The challenge received a large number of submissions from domestic and international data scientists, with highly... (More); Sample mislabeling or misannotation has been a long-standing problem in scientific research, particularly prevalent in large-scale, multi-omic studies due to the complexity of multi-omic workflows. There exists an urgent need for implementing quality controls to automatically screen for and correct sample mislabels or misannotations in multi-omic studies. Here, we describe a crowdsourced precisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge, which provides a framework for systematic benchmarking and evaluation of mislabel identification and correction methods for integrative proteogenomic studies. The challenge received a large number of submissions from domestic and international data scientists, with highly variable performance observed across the submitted methods. Post-challenge collaboration between the top-performing teams and the challenge organizers has created an open-source software, COSMO, with demonstrated high accuracy and robustness in mislabeling identification and correction in simulated and real multi-omic datasets. (Less)
Abstract (Swedish): In a community effort to combat sample mislabeling in multi-omic studies, computational solutions received show a wide range of accuracy. The final collaborative product, COSMO, achieves high performance. Applying COSMO to published datasets demonstrates biological impact of the tool.

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/6db11878-fef5-4572-8ae5-4094615c5b88

author

Yoo, Seungyeul ; Shi, Zhiao ; Wen, Bo ; Kho, SoonJye ; Pan, Renke ; Feng, Hanying ; Chen, Hong ; Carlsson, Anders ^LU ; Edén, Patrik ^LU and Ma, Weiping , et al. (More)

Yoo, Seungyeul ; Shi, Zhiao ; Wen, Bo ; Kho, SoonJye ; Pan, Renke ; Feng, Hanying ; Chen, Hong ; Carlsson, Anders ^LU ; Edén, Patrik ^LU ; Ma, Weiping ; Raymer, Michael ; Maier, Ezekiel J. ; Tezak, Zivana ; Johansson, Elaine ; Hinton, Denise ; Rodriguez, Henry ; Zhu, Jun ; Boja, Emily ; Wang, Pei and Zhang, Bing (Less)

organization

publishing date

2021-05-14

type

Contribution to journal

publication status

published

subject

Bioinformatics and Computational Biology

keywords

proteomics, genomics, mislabeling

in

Patterns

volume

2

issue

5

article number

100245

pages

14 pages

publisher

Cell Press

external identifiers

scopus:85105706455
pmid:34036290

ISSN

2666-3899

DOI

10.1016/j.patter.2021.100245

language

English

LU publication?

yes

id

6db11878-fef5-4572-8ae5-4094615c5b88

date added to LUP

2021-05-19 11:45:42

date last changed

2025-12-02 02:49:42

@article{6db11878-fef5-4572-8ae5-4094615c5b88,
  abstract     = {{Sample mislabeling or misannotation has been a long-standing problem in scientific research, particularly prevalent in large-scale, multi-omic studies due to the complexity of multi-omic workflows. There exists an urgent need for implementing quality controls to automatically screen for and correct sample mislabels or misannotations in multi-omic studies. Here, we describe a crowdsourced precisionFDA NCI-CPTAC Multi-omics Enabled Sample Mislabeling Correction Challenge, which provides a framework for systematic benchmarking and evaluation of mislabel identification and correction methods for integrative proteogenomic studies. The challenge received a large number of submissions from domestic and international data scientists, with highly variable performance observed across the submitted methods. Post-challenge collaboration between the top-performing teams and the challenge organizers has created an open-source software, COSMO, with demonstrated high accuracy and robustness in mislabeling identification and correction in simulated and real multi-omic datasets.}},
  author       = {{Yoo, Seungyeul and Shi, Zhiao and Wen, Bo and Kho, SoonJye and Pan, Renke and Feng, Hanying and Chen, Hong and Carlsson, Anders and Edén, Patrik and Ma, Weiping and Raymer, Michael and Maier, Ezekiel J. and Tezak, Zivana and Johansson, Elaine and Hinton, Denise and Rodriguez, Henry and Zhu, Jun and Boja, Emily and Wang, Pei and Zhang, Bing}},
  issn         = {{2666-3899}},
  keywords     = {{proteomics, genomics, mislabeling}},
  language     = {{eng}},
  month        = {{05}},
  number       = {{5}},
  publisher    = {{Cell Press}},
  series       = {{Patterns}},
  title        = {{A community effort to identify and correct mislabeled samples in proteogenomic studies}},
  url          = {{http://dx.doi.org/10.1016/j.patter.2021.100245}},
  doi          = {{10.1016/j.patter.2021.100245}},
  volume       = {{2}},
  year         = {{2021}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

A community effort to identify and correct mislabeled samples in proteogenomic studies