Advanced

A novel pipeline for protein level quantification using peptide mass spectrometry data

Møller, Line K. (2017) BINP30 20171
Degree Projects in Bioinformatics
Abstract
Accurate quantitative protein analysis is crucial to identify proteins that are differentially abundant between samples, to provide clues about disease mechanisms or act as biomarkers. However, in most workflows, proteins are digested to peptides, and the resulting peptides are quantified as proxies for proteins. No consensus method for going from peptide to protein level quantification exist, and existing methods are limited in their workflow compatibility and user interfaces. To solve this, I have developed QPLine, a pipeline for protein quantification and differential abundance analysis. It offers three different algorithms as the existing workflows Diffacto and Inferno have been integrated and a novel workflow has been developed. This... (More)
Accurate quantitative protein analysis is crucial to identify proteins that are differentially abundant between samples, to provide clues about disease mechanisms or act as biomarkers. However, in most workflows, proteins are digested to peptides, and the resulting peptides are quantified as proxies for proteins. No consensus method for going from peptide to protein level quantification exist, and existing methods are limited in their workflow compatibility and user interfaces. To solve this, I have developed QPLine, a pipeline for protein quantification and differential abundance analysis. It offers three different algorithms as the existing workflows Diffacto and Inferno have been integrated and a novel workflow has been developed. This 'Mixed method' combines protein grouping and factor analysis from Diffacto and peptide summarization and differential abundance analysis from Inferno. QPLine is easily included in a quantitative mass spectrometry workflow as it is compatible with input data directly from the normalization tool Normalyzer. This solution provides a modularized workflow with flexibility regarding normalization and peptide summarization method in a user-friendly command line interface that runs on any OS.
The results show that each method has strengths and weaknesses and no method is optimal for all types of datasets. Diffacto performed well at reporting true positives, but lacked efficient control for false positives. Inferno was better at controlling for false positives, but sacrificed true positives for higher specificity. This situation was reversed when analyzing another reference dataset. The Mixed method showed promising results as a compromise between the two, however in one study the method identified fewest true positives. Analysis of a biological dataset showed that protein-level data from all 3 workflows clustered paired samples, in contrast to peptide-level clustering. This suggests that the workflows perform well using real biological data.
As no single method provides consistently good results, QPLine is especially useful for comparing protein quantity and differential abundance using methods that differ in shared peptide handling, peptide summarization and statistical methods. (Less)
Popular Abstract
Improving the method for measuring the levels of proteins

Measuring the amount of proteins in the body can help us to understand why diseases occur and how to treat them. This is not as straight-forward as it sounds, so scientists are working on improving the way this can be done. In this project we have studied the way protein data is analyzed and have created a computer program that can perform this analysis in different ways.
Proteins are a very important part of the cells in the body. They perform all the necessary functions of the cells and provide stability and structure. In human disease, proteins often play a role in the malfunction of the body. In cancer for example, it can happen that some of the proteins related to cell... (More)
Improving the method for measuring the levels of proteins

Measuring the amount of proteins in the body can help us to understand why diseases occur and how to treat them. This is not as straight-forward as it sounds, so scientists are working on improving the way this can be done. In this project we have studied the way protein data is analyzed and have created a computer program that can perform this analysis in different ways.
Proteins are a very important part of the cells in the body. They perform all the necessary functions of the cells and provide stability and structure. In human disease, proteins often play a role in the malfunction of the body. In cancer for example, it can happen that some of the proteins related to cell growth are not removed as they should be, in order to control the growth. The cells will then have too much of these proteins, which can cause the cells to grow and multiply uncontrollably.

Most proteins are too large to measure using the currently available techniques. For this reason the proteins are cut into smaller pieces before they are analyzed. This results in a large puzzle where the information from each little piece has to be put together again to understand how much of each protein was present in the sample. In the present moment there is no ’right’ way of puzzling these pieces together, so we have been working on making a computer program that can do this in several different ways and make it easy to compare the results from each of them.

The reason measuring proteins is so difficult is multi-fold. Some proteins have identical parts, which is problematic if the small piece that was analyzed is the identical part, then it is impossible to know which of the proteins we actually have analyzed. Some proteins are difficult to analyze and will only be measured as a single piece of protein. This is very little information about the protein, so it is difficult to know if the information about the amount of protein is accurate.

These problems make it extremely difficult to find the best way of solving the puzzle. It should both provide accurate and useful results and be fast and easy to use. The steps should run automatically, without the need for people to interact with the data, because this can be time consuming and repetitive and it is easy to make mistakes. Our program solves the puzzle automatically in three different ways with different approaches to handling the protein related problems. The program is simple to use and makes it easy to compare the results from the three different methods to get an overview of the proteins and a better chance of finding the important proteins that are making people sick.

Master’s Degree Project in Bioinformatics 2017 (30 credits)
Department of Biology, Lund University.
Advisors: Fredrik Levander, Jakob Willforss.
Computational Proteomics, Department of Immunotechnology, Lund University. (Less)
Please use this url to cite or link to this publication:
author
Møller, Line K.
supervisor
organization
course
BINP30 20171
year
type
H2 - Master's Degree (Two Years)
subject
language
English
id
8929504
date added to LUP
2017-12-18 14:57:27
date last changed
2017-12-18 14:57:27
@misc{8929504,
  abstract     = {Accurate quantitative protein analysis is crucial to identify proteins that are differentially abundant between samples, to provide clues about disease mechanisms or act as biomarkers. However, in most workflows, proteins are digested to peptides, and the resulting peptides are quantified as proxies for proteins. No consensus method for going from peptide to protein level quantification exist, and existing methods are limited in their workflow compatibility and user interfaces. To solve this, I have developed QPLine, a pipeline for protein quantification and differential abundance analysis. It offers three different algorithms as the existing workflows Diffacto and Inferno have been integrated and a novel workflow has been developed. This 'Mixed method' combines protein grouping and factor analysis from Diffacto and peptide summarization and differential abundance analysis from Inferno. QPLine is easily included in a quantitative mass spectrometry workflow as it is compatible with input data directly from the normalization tool Normalyzer. This solution provides a modularized workflow with flexibility regarding normalization and peptide summarization method in a user-friendly command line interface that runs on any OS.
The results show that each method has strengths and weaknesses and no method is optimal for all types of datasets. Diffacto performed well at reporting true positives, but lacked efficient control for false positives. Inferno was better at controlling for false positives, but sacrificed true positives for higher specificity. This situation was reversed when analyzing another reference dataset. The Mixed method showed promising results as a compromise between the two, however in one study the method identified fewest true positives. Analysis of a biological dataset showed that protein-level data from all 3 workflows clustered paired samples, in contrast to peptide-level clustering. This suggests that the workflows perform well using real biological data.
As no single method provides consistently good results, QPLine is especially useful for comparing protein quantity and differential abundance using methods that differ in shared peptide handling, peptide summarization and statistical methods.},
  author       = {Møller, Line K.},
  language     = {eng},
  note         = {Student Paper},
  title        = {A novel pipeline for protein level quantification using peptide mass spectrometry data},
  year         = {2017},
}