Classification of sequence tags from tandem mass spectrometry spectra using machine learning models

Ortís Sunyer, Júlia

Classification of sequence tags from tandem mass spectrometry spectra using machine learning models

Mark

Ortís Sunyer, Júlia (2022) BINP51 20212
Degree Projects in Bioinformatics

Abstract: Motivation: Proteomics is the large-scale study of all the proteins found in a cell, tissue or organism. In the last few years, and thanks to the development of mass spectrometry and bioinformatics, proteomics has led the research in several fields, ranging from medicine to agriculture. In order to reconstruct the amino acid sequence de novo protein sequencing can be used. It uses the protein’s molecular weight, its mass spectrometry spectrum, and bioinformatics’ tools to reconstruct the sequence without the use of a database. This avoids problems such as the limited amount of data found in the databases. Nonetheless, more research needs to be carried out to optimize the tools and data extraction, specially to deal with the ambiguous... (More); Motivation: Proteomics is the large-scale study of all the proteins found in a cell, tissue or organism. In the last few years, and thanks to the development of mass spectrometry and bioinformatics, proteomics has led the research in several fields, ranging from medicine to agriculture. In order to reconstruct the amino acid sequence de novo protein sequencing can be used. It uses the protein’s molecular weight, its mass spectrometry spectrum, and bioinformatics’ tools to reconstruct the sequence without the use of a database. This avoids problems such as the limited amount of data found in the databases. Nonetheless, more research needs to be carried out to optimize the tools and data extraction, specially to deal with the ambiguous spectra of long peptides. In this project, several machine learning algorithms were created using TensorFlow and Keras. The aim was for at least one of the models to correctly identify sequence tags extracted from tandem mass spectrometry spectra from fake tags.

Results: Seven machine learning models were successfully built to classify sequence tags from tandem mass spectrometry spectra. Upon evaluation of the models, two of them delt with the data better, according to several statistical parameters (confusion matrix outcomes, accuracy, precision, recall and area under the curve) and managed to classify the true tags of each spectrum largely correctly. (Less)
Popular Abstract: Machine learning models in the study of proteins

Proteins are complex molecules that play indispensable roles in the correct functioning of every living organism. Proteins are made up of smaller blocks called amino acids, which are attached to each other forming a chain, or more wildly known, an amino acid sequence. The proteome, which is the specific name that all the proteins of an organism are called, can vary greatly not only between species, but also between individuals of the same species or even in the same individual at different points in time. The study of the proteomes is called proteomics and is very important in many fields, such as medicine and agriculture.

The proteome of an organism can be studied in the lab in order... (More); Machine learning models in the study of proteins

Proteins are complex molecules that play indispensable roles in the correct functioning of every living organism. Proteins are made up of smaller blocks called amino acids, which are attached to each other forming a chain, or more wildly known, an amino acid sequence. The proteome, which is the specific name that all the proteins of an organism are called, can vary greatly not only between species, but also between individuals of the same species or even in the same individual at different points in time. The study of the proteomes is called proteomics and is very important in many fields, such as medicine and agriculture.

The proteome of an organism can be studied in the lab in order to obtain all the spectra of each of the proteins. In this case, a spectrum is the plot of the fragments forming each protein according to their differing mass and charge. The spectrum can be then used to reconstruct the protein sequence. There are several methods to do this, which include database search, de novo protein sequencing and hybrid methods. Database search consists of searching for a match of the spectrum in a database of known proteins. Even though this technique is really useful, it depends on already known data, which can lead to problems when studying the proteomes of less known organisms. For this reason, de novo protein sequencing has been gaining popularity. It avoids this issue by building the amino acid sequence from scratch using the sequence’s tags, its spectrum and the protein’s mass. The sequence’s tags are short amino acid sequences derived from the protein sequence. Hybrid methods consist of a mix of database search and de novo protein sequencing.

Machine learning is a type of artificial intelligence that allows for the training of software to gradually become more accurate at predicting data. It uses previously documented data as input to predict new values. This can be very useful in de novo protein sequencing, as it allows for the automation and optimization of the process. Machine learning models are used a lot in day-to-day applications, such as in Google translate or in the research of new drug targets, for example.

The aim of my project was to use machine learning algorithms to see if it was possible to classify real and fake amino acids sequence tags coming from tandem mass spectrometry spectra. This is similar to what is done by Google in their spam email classification. The classification of sequence tags was done without a database, just using the protein’s spectrum, the amino acid sequence and the amino acids’ mass. As this was successfully achieved, more machine learning architectures were built in order to test if the different structures made a difference in the classification of the data.

The results from all the machine learning models show that their architecture affected the classification of the sequence tags. More specifically, two of the models were better at predicting the data than the other ones according to the statistical methods used to evaluate and compare the models.

Master’s Degree Project in Bioinformatics, 45 credits, 2022
Department of Biology, Lund University
Advisors: Lars Malmström and Carlos Gueto Tettay
Infection Medicine Proteomics, BMC D13, Lund University (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9102740

author

Ortís Sunyer, Júlia

supervisor

Lars Malmström

organization

Degree Projects in Bioinformatics

course

BINP51 20212

year

2022

type

H2 - Master's Degree (Two Years)

subject

Biology and Life Sciences

language

English

id

9102740

date added to LUP

2022-11-04 11:51:34

date last changed

2022-11-04 11:51:34

@misc{9102740,
  abstract     = {{Motivation: Proteomics is the large-scale study of all the proteins found in a cell, tissue or organism. In the last few years, and thanks to the development of mass spectrometry and bioinformatics, proteomics has led the research in several fields, ranging from medicine to agriculture. In order to reconstruct the amino acid sequence de novo protein sequencing can be used. It uses the protein’s molecular weight, its mass spectrometry spectrum, and bioinformatics’ tools to reconstruct the sequence without the use of a database. This avoids problems such as the limited amount of data found in the databases. Nonetheless, more research needs to be carried out to optimize the tools and data extraction, specially to deal with the ambiguous spectra of long peptides. In this project, several machine learning algorithms were created using TensorFlow and Keras. The aim was for at least one of the models to correctly identify sequence tags extracted from tandem mass spectrometry spectra from fake tags.

Results: Seven machine learning models were successfully built to classify sequence tags from tandem mass spectrometry spectra. Upon evaluation of the models, two of them delt with the data better, according to several statistical parameters (confusion matrix outcomes, accuracy, precision, recall and area under the curve) and managed to classify the true tags of each spectrum largely correctly.}},
  author       = {{Ortís Sunyer, Júlia}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Classification of sequence tags from tandem mass spectrometry spectra using machine learning models}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Classification of sequence tags from tandem mass spectrometry spectra using machine learning models