Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease
(2022) In Biologia 77(12). p.3615-3622- Abstract
Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein... (More)
Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein engineering techniques. Thus, a web platform has been proposed to emphasize the real-time application of this enzyme-specific classification model. We designed a framework that can aid protein engineers in combining their sequence data and the classification model and employ it to align query sequences against the custom databases and identify similar novel enzymes along with their thermophilic nature.
(Less)
- author
- Sunny, Jithin S. ; Kumar, Atul LU ; Nisha, Khairun and Saleena, Lilly M.
- organization
- publishing date
- 2022-12
- type
- Contribution to journal
- publication status
- published
- subject
- keywords
- Machine learning, Protein engineering, Random forest, Serine protease, Thermophilic
- in
- Biologia
- volume
- 77
- issue
- 12
- pages
- 8 pages
- publisher
- Springer
- external identifiers
-
- scopus:85139161262
- ISSN
- 0006-3088
- DOI
- 10.1007/s11756-022-01214-4
- language
- English
- LU publication?
- yes
- id
- 147b78d7-d470-4316-ba8c-d1ebd9bcca7f
- date added to LUP
- 2022-12-14 13:24:08
- date last changed
- 2023-04-06 03:20:09
@article{147b78d7-d470-4316-ba8c-d1ebd9bcca7f, abstract = {{<p>Several machine learning models have been formulated for protein classification based on an important prerequisite for industrial usage, thermostability, and described herein a classification model for a specific enzyme; serine protease. For building the classifier, 283 thermophilic and 200 mesophilic bacterial genomes were mined for their respective serine protease sequences. Features were extracted from 760 sequences, followed by feature selection. We deployed a random forest-based classifier that identified thermophilic and non-thermophilic serine proteases with an accuracy of 97.11%, higher than other benchmark machine learning methods. Knowledge of thermostability and amino acid positional shifts can be vital for downstream protein engineering techniques. Thus, a web platform has been proposed to emphasize the real-time application of this enzyme-specific classification model. We designed a framework that can aid protein engineers in combining their sequence data and the classification model and employ it to align query sequences against the custom databases and identify similar novel enzymes along with their thermophilic nature.</p>}}, author = {{Sunny, Jithin S. and Kumar, Atul and Nisha, Khairun and Saleena, Lilly M.}}, issn = {{0006-3088}}, keywords = {{Machine learning; Protein engineering; Random forest; Serine protease; Thermophilic}}, language = {{eng}}, number = {{12}}, pages = {{3615--3622}}, publisher = {{Springer}}, series = {{Biologia}}, title = {{Converting the genomic knowledge base to build protein specific machine learning prediction models; a classification study on thermophilic serine protease}}, url = {{http://dx.doi.org/10.1007/s11756-022-01214-4}}, doi = {{10.1007/s11756-022-01214-4}}, volume = {{77}}, year = {{2022}}, }