Optimal peptide quantification via machine learning enhanced fragment ion ranking in DIA-MS proteomics

Lu, Lina

Optimal peptide quantification via machine learning enhanced fragment ion ranking in DIA-MS proteomics

Mark

Lu, Lina (2022) BINP51 20212
Degree Projects in Bioinformatics

Abstract: In a standard mass spectrometry workflow, acquired mass spectra are searched against a library of peptides to extract peptide spectra matches (PSMs). Peptides are normally quantified by aggregating the intensities of fragment ions extracted from MS/MS spectra. However, quantifying peptides by summing up the intensities of all fragment ions in PSMs can resulting in inaccurate results due to the fact that multiple fragment ions can interfere with each other in complex samples. This project aims to use machine learning to enhance fragment ion ranking for every precursor to ensure optimal peptide quantification. Here, we describe a workflow that leverages machine learning to pick only the most confident fragments extracted for each potential... (More); In a standard mass spectrometry workflow, acquired mass spectra are searched against a library of peptides to extract peptide spectra matches (PSMs). Peptides are normally quantified by aggregating the intensities of fragment ions extracted from MS/MS spectra. However, quantifying peptides by summing up the intensities of all fragment ions in PSMs can resulting in inaccurate results due to the fact that multiple fragment ions can interfere with each other in complex samples. This project aims to use machine learning to enhance fragment ion ranking for every precursor to ensure optimal peptide quantification. Here, we describe a workflow that leverages machine learning to pick only the most confident fragments extracted for each potential peptide for quantification. We demonstrate the usability of the workflow on yeast standard benchmark data, showing that the average accuracy of quantification and differential expression across all optimized methods is 22.46% higher than the standard workflow. In addition, we investigate the performance of our workflow on existing complex and low-fold-change proteomic data containing four species and demonstrate the generalizability of our models on unrelated diverse data sets. (Less)
Popular Abstract: Simple, efficient and accurate DIA-MS analysis using enhanced fragment ion ranking via machine learning

In biological research, it is usually necessary to compare the expression of a protein in different states or in different individuals to find disease-related targets. This requires quantitative analysis of the protein. Previous methods to identify and quantify proteins using biomolecules are low-throughput, typically only a limited number of proteins can be identified and quantified in an experiment. In recent years, mass spectrometers have been used in proteomics researches. The mass spectrometer can detect the electrical charge of the ion, the mass of the ion, and the intensity of the ion after the protein is digested and ionized,... (More); Simple, efficient and accurate DIA-MS analysis using enhanced fragment ion ranking via machine learning

In biological research, it is usually necessary to compare the expression of a protein in different states or in different individuals to find disease-related targets. This requires quantitative analysis of the protein. Previous methods to identify and quantify proteins using biomolecules are low-throughput, typically only a limited number of proteins can be identified and quantified in an experiment. In recent years, mass spectrometers have been used in proteomics researches. The mass spectrometer can detect the electrical charge of the ion, the mass of the ion, and the intensity of the ion after the protein is digested and ionized, these data is used to identify and quantify proteins. Based on past experiences, proteins have theoretical ion information and are stored in a library, and by comparing the data obtained from the mass spectrometer with the information in the library, it is efficient to identify what proteins are in the sample. The ion intensity of the ion corresponding to the protein then is used to quantify the protein. Many successful mass spectrometry proteomics methods have been published in previous studies, especially DIA-MS, using mass spectrometers and powerful data analysis software, allowing all proteins to be identified and quantified from samples containing thousands of proteins.

Machine learning is a commonly used data analysis method in MS quantification expiriments. Generally, software that compares the actual and theoretical spectra of proteins contains a large number of false matches. Downstream data analysis software uses machine learning to identify these false matches. After removing the false matches, using the intensities of all fragment ions in the correct matches to calculate the protein abundance can improve the accuracy of the result.
These past methods solved most of the problems, but ignored the fact that it is common for fragment ions to interfere with each other in complex samples. Some fragment ions in a correct match may fail if certain features are considered. For example, the actual intensity of a high-intensity fragment ion may be higher than the theoretical intensity in the library, possibly because the fragment ion is shared by multiple peptides, so the intensity of this fragment ion cannot fully represent any one peptide. These conditions lead to inaccurate quantification.

Improvements and future possibilities
To address this problem, we used machine learning and algorithms to further identify and filter false fragment ions, optimize the ordering of fragment ions, and combine the intensity in different way to quantify the proteins. The results showed a 22% increase in the accuracy of results compared to the standard workflow. We also found that machine learning models can be used on other unrelated diverse datasets. Once a good machine learning model is trained, the model can be reused for other datasets, which will save model training process and make the whole process easy to use.

The results show the feasibility and effectiveness of using machine learning and algorithms to optimize fragment ion ranking, and the idea can be universally applied to all search engine output data. But more research needs to be done, and the method needs more testing. In the future, the features used to train the model can be further optimized, and other algorithms can be used to optimize the training dataset to further improve the existing machine learning model.

BINP51, Bioinformatics: Master´s Degree Project, 45 credits
Department of Biology, Lund University

Advisor: Lars Malmström
Co-advisor: Aaron Scott
Department of Clinical Sciences, Lund University, Klinikgatan, BMC D13, SE-22184, Lund, Sweden (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9102925

author

Lu, Lina

supervisor

Lars Malmström ^LU

organization

Degree Projects in Bioinformatics

course

BINP51 20212

year

2022

type

H2 - Master's Degree (Two Years)

subject

Biology and Life Sciences

language

English

id

9102925

date added to LUP

2022-11-07 16:20:13

date last changed

2022-11-07 16:20:13

@misc{9102925,
  abstract     = {{In a standard mass spectrometry workflow, acquired mass spectra are searched against a library of peptides to extract peptide spectra matches (PSMs). Peptides are normally quantified by aggregating the intensities of fragment ions extracted from MS/MS spectra. However, quantifying peptides by summing up the intensities of all fragment ions in PSMs can resulting in inaccurate results due to the fact that multiple fragment ions can interfere with each other in complex samples. This project aims to use machine learning to enhance fragment ion ranking for every precursor to ensure optimal peptide quantification. Here, we describe a workflow that leverages machine learning to pick only the most confident fragments extracted for each potential peptide for quantification. We demonstrate the usability of the workflow on yeast standard benchmark data, showing that the average accuracy of quantification and differential expression across all optimized methods is 22.46% higher than the standard workflow. In addition, we investigate the performance of our workflow on existing complex and low-fold-change proteomic data containing four species and demonstrate the generalizability of our models on unrelated diverse data sets.}},
  author       = {{Lu, Lina}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Optimal peptide quantification via machine learning enhanced fragment ion ranking in DIA-MS proteomics}},
  year         = {{2022}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Optimal peptide quantification via machine learning enhanced fragment ion ranking in DIA-MS proteomics