Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Applying Mondrian Cross-Conformal Prediction to Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets

Sun, Jiangming LU orcid ; Carlsson, Lars ; Ahlberg, Ernst ; Norinder, Ulf ; Engkvist, Ola and Chen, Hongming (2017) In Journal of Chemical Information and Modeling 57(7). p.1591-1598
Abstract

Conformal prediction has been proposed as a more rigorous way to define prediction confidence compared to other application domain concepts that have earlier been used for QSAR modeling. One main advantage of such a method is that it provides a prediction region potentially with multiple predicted labels, which contrasts to the single valued (regression) or single label (classification) output predictions by standard QSAR modeling algorithms. Standard conformal prediction might not be suitable for imbalanced data sets. Therefore, Mondrian cross-conformal prediction (MCCP) which combines the Mondrian inductive conformal prediction with cross-fold calibration sets has been introduced. In this study, the MCCP method was applied to 18... (More)

Conformal prediction has been proposed as a more rigorous way to define prediction confidence compared to other application domain concepts that have earlier been used for QSAR modeling. One main advantage of such a method is that it provides a prediction region potentially with multiple predicted labels, which contrasts to the single valued (regression) or single label (classification) output predictions by standard QSAR modeling algorithms. Standard conformal prediction might not be suitable for imbalanced data sets. Therefore, Mondrian cross-conformal prediction (MCCP) which combines the Mondrian inductive conformal prediction with cross-fold calibration sets has been introduced. In this study, the MCCP method was applied to 18 publicly available data sets that have various imbalance levels varying from 1:10 to 1:1000 (ratio of active/inactive compounds). Our results show that MCCP in general performed well on bioactivity data sets with various imbalance levels. More importantly, the method not only provides confidence of prediction and prediction regions compared to standard machine learning methods but also produces valid predictions for the minority class. In addition, a compound similarity based nonconformity measure was investigated. Our results demonstrate that although it gives valid predictions, its efficiency is much worse than that of model dependent metrics.

(Less)
Please use this url to cite or link to this publication:
author
; ; ; ; and
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Machine learning, Support Vector Machine
in
Journal of Chemical Information and Modeling
volume
57
issue
7
pages
1591 - 1598
publisher
The American Chemical Society (ACS)
external identifiers
  • pmid:28628322
  • scopus:85025698990
ISSN
1549-9596
DOI
10.1021/acs.jcim.7b00159
language
English
LU publication?
no
additional info
Funding Information: This research has received funding from the ExCAPE project within European Union's Horizon 2020 framework under Grant Agreement no. 671555. We thank Prof. Alex Gammerman and Dr. Paolo Toccaceli (Royal Holloway, University of London) for helpful discussions. The research at Swetox (UN) was supported by Stockholm County Council Knut & Alice Wallenberg Foundation, and Swedish Research Council FORMAS. Publisher Copyright: © 2017 American Chemical Society.
id
d82f2fcd-1ff2-4f6f-adf1-f212faa8a479
date added to LUP
2023-04-24 15:34:28
date last changed
2024-04-05 18:39:53
@article{d82f2fcd-1ff2-4f6f-adf1-f212faa8a479,
  abstract     = {{<p>Conformal prediction has been proposed as a more rigorous way to define prediction confidence compared to other application domain concepts that have earlier been used for QSAR modeling. One main advantage of such a method is that it provides a prediction region potentially with multiple predicted labels, which contrasts to the single valued (regression) or single label (classification) output predictions by standard QSAR modeling algorithms. Standard conformal prediction might not be suitable for imbalanced data sets. Therefore, Mondrian cross-conformal prediction (MCCP) which combines the Mondrian inductive conformal prediction with cross-fold calibration sets has been introduced. In this study, the MCCP method was applied to 18 publicly available data sets that have various imbalance levels varying from 1:10 to 1:1000 (ratio of active/inactive compounds). Our results show that MCCP in general performed well on bioactivity data sets with various imbalance levels. More importantly, the method not only provides confidence of prediction and prediction regions compared to standard machine learning methods but also produces valid predictions for the minority class. In addition, a compound similarity based nonconformity measure was investigated. Our results demonstrate that although it gives valid predictions, its efficiency is much worse than that of model dependent metrics.</p>}},
  author       = {{Sun, Jiangming and Carlsson, Lars and Ahlberg, Ernst and Norinder, Ulf and Engkvist, Ola and Chen, Hongming}},
  issn         = {{1549-9596}},
  keywords     = {{Machine learning; Support Vector Machine}},
  language     = {{eng}},
  month        = {{07}},
  number       = {{7}},
  pages        = {{1591--1598}},
  publisher    = {{The American Chemical Society (ACS)}},
  series       = {{Journal of Chemical Information and Modeling}},
  title        = {{Applying Mondrian Cross-Conformal Prediction to Estimate Prediction Confidence on Large Imbalanced Bioactivity Data Sets}},
  url          = {{http://dx.doi.org/10.1021/acs.jcim.7b00159}},
  doi          = {{10.1021/acs.jcim.7b00159}},
  volume       = {{57}},
  year         = {{2017}},
}