Advanced

GhostMS: An error-controlled machine learning approach to efficient alignment and quantification of multi-sample experiments in Mass Spectrometry-based Proteomics

Scott, Aaron (2019) BINP52 20181
Degree Projects in Bioinformatics
Abstract
In recent years, the field of mass spectrometry (MS) has grown significantly, allowing for shotgun proteomic experiments to be used increasingly in biomarker discovery experiments. However, using standard methods of MS not all peptides present in a sample may be identified. To combat this, a procedure termed alignment, or match-between-runs, is used to propagate identifications from one run to another toincrease the depth of quantification. There are some methods that have made significant progress in this area, but a common reliance on the matching of single pairs of runs (pair-wise matching) and the lack of control for False Discovery Rate (FDR) renders these methods less effective on experiments with a large sample number. In an attempt... (More)
In recent years, the field of mass spectrometry (MS) has grown significantly, allowing for shotgun proteomic experiments to be used increasingly in biomarker discovery experiments. However, using standard methods of MS not all peptides present in a sample may be identified. To combat this, a procedure termed alignment, or match-between-runs, is used to propagate identifications from one run to another toincrease the depth of quantification. There are some methods that have made significant progress in this area, but a common reliance on the matching of single pairs of runs (pair-wise matching) and the lack of control for False Discovery Rate (FDR) renders these methods less effective on experiments with a large sample number. In an attempt to build on these solutions, here we present an optimized approach to alignment, whichavoids a pair-wise matching and leverages machine learning methods to control for FDR at a feature by feature basis. Using an iterative approach, we align runs to consensus identification clusters, while using shared peptide groups between runs to automate parameter optimization and correct for retention time (RT) and mass to charge (mz) deviations. Machine learning methods are employed to rank peptide feature matches for quantification and to control for FDR during the alignment procedure. This solution has shown to scale efficiently for large sample sizes without compromising accuracy, providing around a 5% increase in precision (88.7%) and identifying 91% of known differentially expressed peptides from evaluation spike-in data sets. Termed GhostMS, it is freely available (with requested access) here https://github.com/arnscott/ghost as a command line Python (Version 3.6) application. (Less)
Popular Abstract
Machine learning methods for accurate quantification and alignment in computational proteomics.

When presented with a biological question, for example, why certain crops may grow better in a certain climate, or what physical properties a cancer cell may have compared to a healthy cell, it is important to investigate what properties of the samples are responsible for the difference in observed biology. With the advent of next generation technologies, methods to determine the genetic blueprint of cells have led to a dramatic increase in the effectiveness of these types of experiments by providing a look into the molecules that code for the machinery of the cell (ie. DNA, RNA). However, because these genetic codes are just the blueprint,... (More)
Machine learning methods for accurate quantification and alignment in computational proteomics.

When presented with a biological question, for example, why certain crops may grow better in a certain climate, or what physical properties a cancer cell may have compared to a healthy cell, it is important to investigate what properties of the samples are responsible for the difference in observed biology. With the advent of next generation technologies, methods to determine the genetic blueprint of cells have led to a dramatic increase in the effectiveness of these types of experiments by providing a look into the molecules that code for the machinery of the cell (ie. DNA, RNA). However, because these genetic codes are just the blueprint, this sort of investigation only provides insight into what potentially could be present, or built inside of the cell. To investigate what is physically present, it is possible to measure the biological machinery, or proteins, that result from these genetic blueprints.

One method for investigating the protein content of a sample that has been significantly successful in recent years is mass spectrometry. Using the physical properties of a protein, a mass spectometer can determine the mass, the electrical charge, and the quantity of the protein present in the sample. In order to gain information about what the possible sequence of molecules (amino acids) that make up the protein, an additional step is needed. During this step, the spectometer selects proteins that are the most abundant, breaks them up, and scans the resultant fragmented pieces to get a fingerprint of the amino acids that make up the protein. However, due to technological limits inherent to this process, proteins that are present in a single sample may not be selected to be fractionated and measured to determine their sequence for identification. To help combat this, a computational procedure called alignment, or match-between-runs, can be employed to merge together the identifications from multiple samples to get a much higher number of overall proteins.

Although significant progress in processing the data has been achieved, many current methods in this discipline suffer from long processing times, a non-trivial amount of user parameters to be set, and a lack of error control, making it difficult to quantify the confidence in the aligned identifications. The focus of this thesis was to create an algorithmic method that significantly reduces these processing times while simultaneously controlling the error rate of matched identifications and automatically setting parameters for alignment. To reach this goal, this method proposes a combination of pre-alignment filtering and machine learning methods to reach the end goal of accurate quantification.

The filtering methods employed here weed out erroneous protein identification matches by eliminating outliers to create a consensus group of proteins that all of the samples can be aligned to. This cut down significantly on run-time as it avoided a pair-wise matching between the different samples, while ensuring that each sample could be compared to all potential proteins that could be present. To determine whether or not a potential protein could be matched, a machine learning method, known as a Support Vector Machine (SVM), was modeled to categorize all potential protein matches as "good" or "bad" and rank them based on a score calculated from the SVM. From this score we could precisely control the error rate at a level not present in other available methods. On an artificial test data-set, with a known set of proteins present in varying concentrations, this method was able to detect 91% of these proteins with 5% less falsely positive discoveries than existing methods.

Although the results here are promising, more studies need to be done to compare it to other existing quantification procedures. If successful, the implications in protein biomarker discovery could be quite significant. Notably, the possibility for high accuracy and confidence in the reported results would be particularly useful in a clinical setting, where there is less room for error in investigating potentially significant proteins.

Master's Degree Project in Bioinformatics 60 credits 2019
Department of Biology, Lund University

Advisors: Fredrik Levander, Jakob Willforss
Department of Immunotechnology, LTH (Less)
Please use this url to cite or link to this publication:
author
Scott, Aaron
supervisor
organization
course
BINP52 20181
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
8982010
date added to LUP
2019-06-12 10:32:41
date last changed
2019-06-12 10:32:41
@misc{8982010,
  abstract     = {In recent years, the field of mass spectrometry (MS) has grown significantly, allowing for shotgun proteomic experiments to be used increasingly in biomarker discovery experiments. However, using standard methods of MS not all peptides present in a sample may be identified. To combat this, a procedure termed alignment, or match-between-runs, is used to propagate identifications from one run to another toincrease the depth of quantification. There are some methods that have made significant progress in this area, but a common reliance on the matching of single pairs of runs (pair-wise matching) and the lack of control for False Discovery Rate (FDR) renders these methods less effective on experiments with a large sample number. In an attempt to build on these solutions, here we present an optimized approach to alignment, whichavoids a pair-wise matching and leverages machine learning methods to control for FDR at a feature by feature basis. Using an iterative approach, we align runs to consensus identification clusters, while using shared peptide groups between runs to automate parameter optimization and correct for retention time (RT) and mass to charge (mz) deviations. Machine learning methods are employed to rank peptide feature matches for quantification and to control for FDR during the alignment procedure. This solution has shown to scale efficiently for large sample sizes without compromising accuracy, providing around a 5% increase in precision (88.7%) and identifying 91% of known differentially expressed peptides from evaluation spike-in data sets. Termed GhostMS, it is freely available (with requested access) here https://github.com/arnscott/ghost as a command line Python (Version 3.6) application.},
  author       = {Scott, Aaron},
  language     = {eng},
  note         = {Student Paper},
  title        = {GhostMS: An error-controlled machine learning approach to efficient alignment and quantification of multi-sample experiments in Mass Spectrometry-based Proteomics},
  year         = {2019},
}