Skip to main content

Lund University Publications

LUND UNIVERSITY LIBRARIES

Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease

Sunny, Jithin S. ; Kumar, Atul LU orcid ; Nisha, Khairun and Saleena, Lilly M. (2022) In Biologia 77(12). p.3615-3622
Abstract

Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein... (More)

Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein engineering techniques. Thus, a web platform has been proposed to emphasize the real-time application of this enzyme-specific classification model. We designed a framework that can aid protein engineers in combining their sequence data and the classification model and employ it to align query sequences against the custom databases and identify similar novel enzymes along with their thermophilic nature.

(Less)
Please use this url to cite or link to this publication:
author
; ; and
organization
publishing date
type
Contribution to journal
publication status
published
subject
keywords
Machine learning, Protein engineering, Random forest, Serine protease, Thermophilic
in
Biologia
volume
77
issue
12
pages
8 pages
publisher
Springer
external identifiers
  • scopus:85139161262
ISSN
0006-3088
DOI
10.1007/s11756-022-01214-4
language
English
LU publication?
yes
id
147b78d7-d470-4316-ba8c-d1ebd9bcca7f
date added to LUP
2022-12-14 13:24:08
date last changed
2023-04-06 03:20:09
@article{147b78d7-d470-4316-ba8c-d1ebd9bcca7f,
  abstract     = {{<p>Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein engineering techniques. Thus, a web platform has been proposed to emphasize the real-time application of this enzyme-specific classification model. We designed a framework that can aid protein engineers in combining their sequence data and the classification model and employ it to align query sequences against the custom databases and identify similar novel enzymes along with their thermophilic nature.</p>}},
  author       = {{Sunny, Jithin S. and Kumar, Atul and Nisha, Khairun and Saleena, Lilly M.}},
  issn         = {{0006-3088}},
  keywords     = {{Machine learning; Protein engineering; Random forest; Serine protease; Thermophilic}},
  language     = {{eng}},
  number       = {{12}},
  pages        = {{3615--3622}},
  publisher    = {{Springer}},
  series       = {{Biologia}},
  title        = {{Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease}},
  url          = {{http://dx.doi.org/10.1007/s11756-022-01214-4}},
  doi          = {{10.1007/s11756-022-01214-4}},
  volume       = {{77}},
  year         = {{2022}},
}