Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Accurate prediction of B-form/A-form DNA conformation propensity from primary sequence : A machine learning and free energy handshake

Gupta, Abhijit ; Kulkarni, Mandar LU and Mukherjee, Arnab (2021) In Patterns 2(9).
Abstract

DNA carries the genetic code of life, with different conformations associated with different biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. We have deployed a host of machine learning algorithms, including the popular state-of-the-art LightGBM (a gradient boosting model), for building prediction models. We used the nested cross-validation strategy to address the issues of “overfitting” and selection bias. This simultaneously provides an unbiased estimate of the generalization performance of a machine learning algorithm and allows us to tune the hyperparameters optimally. Furthermore, we built a secondary model based... (More)

DNA carries the genetic code of life, with different conformations associated with different biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. We have deployed a host of machine learning algorithms, including the popular state-of-the-art LightGBM (a gradient boosting model), for building prediction models. We used the nested cross-validation strategy to address the issues of “overfitting” and selection bias. This simultaneously provides an unbiased estimate of the generalization performance of a machine learning algorithm and allows us to tune the hyperparameters optimally. Furthermore, we built a secondary model based on SHAP (SHapley Additive exPlanations) that offers crucial insight into model interpretability. Our detailed model-building strategy and robust statistical validation protocols tackle the formidable challenge of working on small datasets, which is often the case in biological and medical data.

(Less)
Please use this url to cite or link to this publication:
author
; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
DNA conformation, DNA sequence, DSML 2: Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem, genome, LightGBM, machine learning, nested cross-validation
in
Patterns
volume
2
issue
9
article number
100329
publisher
Cell Press
external identifiers
  • pmid:34553171
  • scopus:85123578294
ISSN
2666-3899
DOI
10.1016/j.patter.2021.100329
language
English
LU publication?
yes
id
8cb39788-1c38-4e4f-bc7e-76d30a77d940
date added to LUP
2022-04-12 16:46:54
date last changed
2024-03-24 16:55:34
@article{8cb39788-1c38-4e4f-bc7e-76d30a77d940,
  abstract     = {{<p>DNA carries the genetic code of life, with different conformations associated with different biological functions. Predicting the conformation of DNA from its primary sequence, although desirable, is a challenging problem owing to the polymorphic nature of DNA. We have deployed a host of machine learning algorithms, including the popular state-of-the-art LightGBM (a gradient boosting model), for building prediction models. We used the nested cross-validation strategy to address the issues of “overfitting” and selection bias. This simultaneously provides an unbiased estimate of the generalization performance of a machine learning algorithm and allows us to tune the hyperparameters optimally. Furthermore, we built a secondary model based on SHAP (SHapley Additive exPlanations) that offers crucial insight into model interpretability. Our detailed model-building strategy and robust statistical validation protocols tackle the formidable challenge of working on small datasets, which is often the case in biological and medical data.</p>}},
  author       = {{Gupta, Abhijit and Kulkarni, Mandar and Mukherjee, Arnab}},
  issn         = {{2666-3899}},
  keywords     = {{DNA conformation; DNA sequence; DSML 2: Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem; genome; LightGBM; machine learning; nested cross-validation}},
  language     = {{eng}},
  month        = {{09}},
  number       = {{9}},
  publisher    = {{Cell Press}},
  series       = {{Patterns}},
  title        = {{Accurate prediction of B-form/A-form DNA conformation propensity from primary sequence : A machine learning and free energy handshake}},
  url          = {{http://dx.doi.org/10.1016/j.patter.2021.100329}},
  doi          = {{10.1016/j.patter.2021.100329}},
  volume       = {{2}},
  year         = {{2021}},
}