Skip to main content

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Identification of Modified Peptides using Open and Conventional Search Engines

Ekvall, Emilia LU and Bjereus, Helena (2021) KIMM05 20211
Department of Immunotechnology
Abstract
Mass Spectrometry- based proteomics has matured to an analytical tool applicable in many areas of life science. Although, limitations remains where much focus has been towards the development of search engines to more accurately identify peptides in fragmentation spectra. This report aims to investigate four search engines; MSGF+, Andromeda, MSFragger and pFind in order to assess the advantages and limitations of open and closed approaches in computational data analysis of Post- Translational Modifications (PTMs). A dataset of synthetic peptides of ground truth was utilized to validate the results, where pFind dominated in the amount of correct identifications, however it also dominated in the amount of false positives. MSFragger tended to... (More)
Mass Spectrometry- based proteomics has matured to an analytical tool applicable in many areas of life science. Although, limitations remains where much focus has been towards the development of search engines to more accurately identify peptides in fragmentation spectra. This report aims to investigate four search engines; MSGF+, Andromeda, MSFragger and pFind in order to assess the advantages and limitations of open and closed approaches in computational data analysis of Post- Translational Modifications (PTMs). A dataset of synthetic peptides of ground truth was utilized to validate the results, where pFind dominated in the amount of correct identifications, however it also dominated in the amount of false positives. MSFragger tended to be the least sensitive to changes in False Discovery Rate (FDR). Each search engine identified 43-70 percent of the synthetic peptides, at a target FDR of 1 and 10 percent. To expand the evaluation to biological data, one of the most comprehensive mass spectrometry dataset available for a cell line was further studied. It contained 166,620 identified unique peptide sequences. MSGF+ dominated in amount of peptide hits in the biological dataset, which did not correlate to pFind's predominance in the synthetic dataset. This highlights the difficulty in validating which search engines performed the best. The search engines had more overlapping results for more common modifications, where it is hypothesised that the true FDR is higher for less frequent modifications. This can be exemplified in the common modification phosphorylation where the engines' set sizes were overlapping to 34-57 percent, compared to the uncommon modification methylation's 5-21 percent. The final conclusion is that no search engine is precise enough to be used with confidence alone, especially for low frequent modifications. This is if minimization of false positive identifications is crucial. The credibility of PTM identifications can be increased by using a combination of search engines. In this regard, it is recommended to further investigate the limitations of search engines by including more variables. (Less)
Popular Abstract
Identification of Modified Peptides using Mass Spectrometry- based Search Engines.

Post- Translational Modifications (PTMs) of proteins are important for the functionality of proteins in various cellular processes. Disruption of diverse PTMs can lead to dysfunction of crucial biological processes, which can lead to development of various diseases. Current computational methods used for PTM detection are highly laborious with low specificity. Therefore, there is an urgent need for more efficient and specific computational analysis tools to predict PTMs with higher accuracy. Proteins control and catalyze almost all cellular processes. Together, these form a structured entity which is called the proteome. Across all species, proteins make... (More)
Identification of Modified Peptides using Mass Spectrometry- based Search Engines.

Post- Translational Modifications (PTMs) of proteins are important for the functionality of proteins in various cellular processes. Disruption of diverse PTMs can lead to dysfunction of crucial biological processes, which can lead to development of various diseases. Current computational methods used for PTM detection are highly laborious with low specificity. Therefore, there is an urgent need for more efficient and specific computational analysis tools to predict PTMs with higher accuracy. Proteins control and catalyze almost all cellular processes. Together, these form a structured entity which is called the proteome. Across all species, proteins make up to 50 percent of the cells dry mass, a number which is remarkable. The proteome network defines the cell's functional state as it adapts to external and internal changes, and determines its phenotype. Being able to understand the proteome is a central challenge in biology.

Identification of the Proteome
Strategies to study the proteome and the associated molecular mechanics have been developed during the past decades. In general, the underlying mechanics include isolation followed by analysis of the protein, with a focus on the function and structure of the molecule. This is usually performed with established methods in the field of biophysics and biotechnology. Furthermore, the technology developments has made it possible to perform large scale measurement of proteomes, which generate extensive datasets. The development of Mass Spectrometry (MS)- based methods have changed the field of proteomic identification. Underlying reasons for the success include MS being inherently specific in identification, its sensitivity and its generic workflow. In general these abilities makes MS have the capacity to identify and quantify almost any protein expressed. Although, limitation still remains at the present time since no MS- based method can fully measure all proteins by itself, but a combination of methods are often used. Through computational analysis, it has been made possible to generate biological insight from these proteomic datasets generated using MS- experiments. Computational search engines can, namely, identify peptides of proteomes from experimentally produced spectra by similarity comparison to theoretical spectra based on public protein sequence databases. Open search methods have in recent years been developed as an attempt to specifically improve the identification of PTMs of proteins, compared to conventional search engines.

Characterizing Protein Modifications
The characterization of protein modification remains a challenge in the field of proteomics. More specifically, a central issue include mapping of PTMs. PTMs refer to amino acid side chain modifications, which occur in some proteins after biosynthesis. MS- based proteomics is a tool well suited for the study of PTMs. Most frequently studied types of PTMs include phosphorylation, methylation, acetylation and ubiquitylation. General functionality of PTMs are shown in Figure 1.

Proteomic Networks
Proteins function very rarely on their own, but depend on what builds into macromolecular complexes. The future of MS and computational analysis will open up the field for more applications. Aside from a focus on protein signaling, a goal for a more comprehensive understanding of proteomic networks will be required. This will further develop the understanding of processes and how proteins interact across all areas of biology. It will also improve the technology in disease modelling, which is an area that holds much promise in the future.

Conclusion
Our analysis of search engines implied that a combinatory use of engines can be preferred to increase the accuracy of the PTM identifications, especially for uncommon, less frequent PTMs. This is equally applicable for both open and conventional search engines. The investigation demonstrated that there are also room for improvement in terms of amount of PTMs identified. Overcoming the limitations in accuracy and efficiency of PTM analysis in large proteomic datasets will be challenging, but when it can be done successfully, it will enormously impact the future understanding of proteins' biological function and thus improve disease modeling. (Less)
Please use this url to cite or link to this publication:
author
Ekvall, Emilia LU and Bjereus, Helena
supervisor
organization
alternative title
Open and Closed Searches for Detection of Post- Translational Modifications in Proteomics Data
course
KIMM05 20211
year
type
H2 - Master's Degree (Two Years)
subject
keywords
Proteomics, Software, Proteome Informatics, Open Modification Search, Shotgun Proteomics, Bioinformatics, Post- Translational Modifications
language
English
id
9055038
date added to LUP
2021-08-18 13:52:11
date last changed
2021-08-18 13:52:11
@misc{9055038,
  abstract     = {{Mass Spectrometry- based proteomics has matured to an analytical tool applicable in many areas of life science. Although, limitations remains where much focus has been towards the development of search engines to more accurately identify peptides in fragmentation spectra. This report aims to investigate four search engines; MSGF+, Andromeda, MSFragger and pFind in order to assess the advantages and limitations of open and closed approaches in computational data analysis of Post- Translational Modifications (PTMs). A dataset of synthetic peptides of ground truth was utilized to validate the results, where pFind dominated in the amount of correct identifications, however it also dominated in the amount of false positives. MSFragger tended to be the least sensitive to changes in False Discovery Rate (FDR). Each search engine identified 43-70 percent of the synthetic peptides, at a target FDR of 1 and 10 percent. To expand the evaluation to biological data, one of the most comprehensive mass spectrometry dataset available for a cell line was further studied. It contained 166,620 identified unique peptide sequences. MSGF+ dominated in amount of peptide hits in the biological dataset, which did not correlate to pFind's predominance in the synthetic dataset. This highlights the difficulty in validating which search engines performed the best. The search engines had more overlapping results for more common modifications, where it is hypothesised that the true FDR is higher for less frequent modifications. This can be exemplified in the common modification phosphorylation where the engines' set sizes were overlapping to 34-57 percent, compared to the uncommon modification methylation's 5-21 percent. The final conclusion is that no search engine is precise enough to be used with confidence alone, especially for low frequent modifications. This is if minimization of false positive identifications is crucial. The credibility of PTM identifications can be increased by using a combination of search engines. In this regard, it is recommended to further investigate the limitations of search engines by including more variables.}},
  author       = {{Ekvall, Emilia and Bjereus, Helena}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Identification of Modified Peptides using Open and Conventional Search Engines}},
  year         = {{2021}},
}