Numerical compression schemes for proteomics mass spectrometry data.

Teleman, Johan; Dowsey, Andrew W; Gonzalez-Galarza, Faviel F; Perkins, Simon; Pratt, Brian; Rost, Hannes; Malmstrom, Lars; Malmström, Johan; Jones, Andrew R; Deutsch, Eric W; Levander, Fredrik

Numerical compression schemes for proteomics mass spectrometry data.

Mark

Teleman, Johan ; Dowsey, Andrew W ; Gonzalez-Galarza, Faviel F ; Perkins, Simon ; Pratt, Brian ; Rost, Hannes ; Malmstrom, Lars ; Malmström, Johan ^LU

; Jones, Andrew R and Deutsch, Eric W , et al. (2014) In Molecular & Cellular Proteomics 13(6). p.1537-1542

Abstract: The open XML format mzML, used for representation of mass spectrometry (MS) data, is pivotal for the development of platform-independent MS analysis software. Although conversion from vendor formats to mzML must take place on a platform on which the vendor libraries are available (i.e. Windows), once mzML files have been generated, they can be used on any platform. However, the mzML format has turned out to be less efficient than vendor formats. In many cases, the naive mzML representation is 4-fold or even up to 18-fold larger compared to the original vendor file. In disk I/O limited setups, a larger data file also leads to longer processing times, which is a problem given the data production rates of modern mass spectrometers. In an... (More); The open XML format mzML, used for representation of mass spectrometry (MS) data, is pivotal for the development of platform-independent MS analysis software. Although conversion from vendor formats to mzML must take place on a platform on which the vendor libraries are available (i.e. Windows), once mzML files have been generated, they can be used on any platform. However, the mzML format has turned out to be less efficient than vendor formats. In many cases, the naive mzML representation is 4-fold or even up to 18-fold larger compared to the original vendor file. In disk I/O limited setups, a larger data file also leads to longer processing times, which is a problem given the data production rates of modern mass spectrometers. In an attempt to reduce this problem, we here present a family of numerical compression algorithms called MS-Numpress, intended for efficient compression of MS data. To facilitate ease of adoption, the algorithms target the binary data in the mzML standard, and support in main proteomics tools is already available. Using a test set of 10 representative MS data files we demonstrate typical file size decreases of 90% when combined with traditional compression, as well as read time decreases of up to 50%. It is envisaged that these improvements will be beneficial for data handling within the MS community. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/4379548

author

Teleman, Johan ; Dowsey, Andrew W ; Gonzalez-Galarza, Faviel F ; Perkins, Simon ; Pratt, Brian ; Rost, Hannes ; Malmstrom, Lars ; Malmström, Johan ^LU

; Jones, Andrew R and Deutsch, Eric W , et al. (More)

Teleman, Johan ; Dowsey, Andrew W ; Gonzalez-Galarza, Faviel F ; Perkins, Simon ; Pratt, Brian ; Rost, Hannes ; Malmstrom, Lars ; Malmström, Johan ^LU

; Jones, Andrew R ; Deutsch, Eric W and Levander, Fredrik (Less)

organization

publishing date

2014

type

Contribution to journal

publication status

published

subject

Infectious Medicine

in

Molecular & Cellular Proteomics

volume

13

issue

6

pages

1537 - 1542

publisher

American Society for Biochemistry and Molecular Biology

external identifiers

pmid:24677029
wos:000337239500011
scopus:84901931503

ISSN

1535-9484

DOI

10.1074/mcp.O114.037879

language

English

LU publication?

yes

id

ec893db5-4f8a-4acd-8d29-d234fd9c5efe (old id 4379548)

alternative location

http://www.ncbi.nlm.nih.gov/pubmed/24677029?dopt=Abstract

date added to LUP

2016-04-01 10:27:10

date last changed

2025-10-14 09:03:23

@article{ec893db5-4f8a-4acd-8d29-d234fd9c5efe,
  abstract     = {{The open XML format mzML, used for representation of mass spectrometry (MS) data, is pivotal for the development of platform-independent MS analysis software. Although conversion from vendor formats to mzML must take place on a platform on which the vendor libraries are available (i.e. Windows), once mzML files have been generated, they can be used on any platform. However, the mzML format has turned out to be less efficient than vendor formats. In many cases, the naive mzML representation is 4-fold or even up to 18-fold larger compared to the original vendor file. In disk I/O limited setups, a larger data file also leads to longer processing times, which is a problem given the data production rates of modern mass spectrometers. In an attempt to reduce this problem, we here present a family of numerical compression algorithms called MS-Numpress, intended for efficient compression of MS data. To facilitate ease of adoption, the algorithms target the binary data in the mzML standard, and support in main proteomics tools is already available. Using a test set of 10 representative MS data files we demonstrate typical file size decreases of 90% when combined with traditional compression, as well as read time decreases of up to 50%. It is envisaged that these improvements will be beneficial for data handling within the MS community.}},
  author       = {{Teleman, Johan and Dowsey, Andrew W and Gonzalez-Galarza, Faviel F and Perkins, Simon and Pratt, Brian and Rost, Hannes and Malmstrom, Lars and Malmström, Johan and Jones, Andrew R and Deutsch, Eric W and Levander, Fredrik}},
  issn         = {{1535-9484}},
  language     = {{eng}},
  number       = {{6}},
  pages        = {{1537--1542}},
  publisher    = {{American Society for Biochemistry and Molecular Biology}},
  series       = {{Molecular & Cellular Proteomics}},
  title        = {{Numerical compression schemes for proteomics mass spectrometry data.}},
  url          = {{https://lup.lub.lu.se/search/files/1857783/4645537.pdf}},
  doi          = {{10.1074/mcp.O114.037879}},
  volume       = {{13}},
  year         = {{2014}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Numerical compression schemes for proteomics mass spectrometry data.