Advanced

A Universal Source Coding Perspective on PPM

Åberg, Jan LU (1999)
Abstract
The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an information-theoretical point of view, based on results and methods from universal source coding theory.



A number of source classes for modeling text-like data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.



Two subproblems of PPM are studied in detail; the problem of sequential multi-alphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side... (More)
The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an information-theoretical point of view, based on results and methods from universal source coding theory.



A number of source classes for modeling text-like data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.



Two subproblems of PPM are studied in detail; the problem of sequential multi-alphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side information in the form of observations from a source with similar parameters.



For the multi-alphabet coding problem, several codes are presented, together with a sufficient condition for asymptotic optimality, which is satified by some of the introduced codes. Furthermore, the natural law of succession, a known solution of the underlying estimation problem, is analysed from the aspect of coding, and a method is given for calculation of the coding probabilities for a known multi-alphabet code.



A code using side information from a similar source is presented and analysed, and is proved to asymptotically outperform codes not using side information if the divergence between the two sources is small.



Based on the above results, modified PPM algorithms are proposed, and methods are developed for estimation of code parameters during encoding. It is experimentally verified that the introduced modifications improve the compression rate of PPM on two standard sets of test data. Several known PPM versions are shown to be characterizable in terms of the introduced source classes.



The combination of PPM and MDL (minimum description length) model class estimation is investigated. Codes using this combination of estimation techniques are given for the different source structures introduced for text-like data, and the modified PPM is shown experimentally to yield an improvement when combined with MDL estimation as well. (Less)
Please use this url to cite or link to this publication:
author
opponent
  • Professor Feder, Meir, Dept. of Electrical Engineering-Systems, Tel Aviv University, Tel Aviv, ISRAEL
organization
publishing date
type
Thesis
publication status
published
subject
keywords
coding with side information, multinomial estimation, maximum probability codes, MDL estimation, PPM, hierarchical source coding, universal source coding, text compression, lossless data compression, Automation, robotics, control engineering, Automatiska system, robotteknik, reglerteknik
pages
172 pages
publisher
Jan Åberg, Dept. of Information Technology, Lund University, P.O. Box 118, SE-221 00 Lund, SWEDEN,
defense location
E:1406, LTH
defense date
1998-10-15 10:00
external identifiers
  • Other:ISRN: LUTEDX/TEIT-99/1014-SE
ISBN
91-7167-017-3
language
English
LU publication?
yes
id
772b415e-5d11-4406-bdec-fa98258b9442 (old id 19019)
date added to LUP
2007-05-24 13:46:28
date last changed
2016-09-19 08:45:13
@misc{772b415e-5d11-4406-bdec-fa98258b9442,
  abstract     = {The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an information-theoretical point of view, based on results and methods from universal source coding theory.<br/><br>
<br/><br>
A number of source classes for modeling text-like data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.<br/><br>
<br/><br>
Two subproblems of PPM are studied in detail; the problem of sequential multi-alphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side information in the form of observations from a source with similar parameters.<br/><br>
<br/><br>
For the multi-alphabet coding problem, several codes are presented, together with a sufficient condition for asymptotic optimality, which is satified by some of the introduced codes. Furthermore, the natural law of succession, a known solution of the underlying estimation problem, is analysed from the aspect of coding, and a method is given for calculation of the coding probabilities for a known multi-alphabet code.<br/><br>
<br/><br>
A code using side information from a similar source is presented and analysed, and is proved to asymptotically outperform codes not using side information if the divergence between the two sources is small.<br/><br>
<br/><br>
Based on the above results, modified PPM algorithms are proposed, and methods are developed for estimation of code parameters during encoding. It is experimentally verified that the introduced modifications improve the compression rate of PPM on two standard sets of test data. Several known PPM versions are shown to be characterizable in terms of the introduced source classes.<br/><br>
<br/><br>
The combination of PPM and MDL (minimum description length) model class estimation is investigated. Codes using this combination of estimation techniques are given for the different source structures introduced for text-like data, and the modified PPM is shown experimentally to yield an improvement when combined with MDL estimation as well.},
  author       = {Åberg, Jan},
  isbn         = {91-7167-017-3},
  keyword      = {coding with side information,multinomial estimation,maximum probability codes,MDL estimation,PPM,hierarchical source coding,universal source coding,text compression,lossless data compression,Automation,robotics,control engineering,Automatiska system,robotteknik,reglerteknik},
  language     = {eng},
  pages        = {172},
  publisher    = {ARRAY(0xbd157d0)},
  title        = {A Universal Source Coding Perspective on PPM},
  year         = {1999},
}