A Universal Source Coding Perspective on PPM

Åberg, Jan

A Universal Source Coding Perspective on PPM

Mark

Åberg, Jan ^LU (1999)

Abstract: The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an information-theoretical point of view, based on results and methods from universal source coding theory.

A number of source classes for modeling text-like data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.

Two subproblems of PPM are studied in detail; the problem of sequential multi-alphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side... (More); The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an information-theoretical point of view, based on results and methods from universal source coding theory.

A number of source classes for modeling text-like data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.

Two subproblems of PPM are studied in detail; the problem of sequential multi-alphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side information in the form of observations from a source with similar parameters.

For the multi-alphabet coding problem, several codes are presented, together with a sufficient condition for asymptotic optimality, which is satified by some of the introduced codes. Furthermore, the natural law of succession, a known solution of the underlying estimation problem, is analysed from the aspect of coding, and a method is given for calculation of the coding probabilities for a known multi-alphabet code.

A code using side information from a similar source is presented and analysed, and is proved to asymptotically outperform codes not using side information if the divergence between the two sources is small.

Based on the above results, modified PPM algorithms are proposed, and methods are developed for estimation of code parameters during encoding. It is experimentally verified that the introduced modifications improve the compression rate of PPM on two standard sets of test data. Several known PPM versions are shown to be characterizable in terms of the introduced source classes.

The combination of PPM and MDL (minimum description length) model class estimation is investigated. Codes using this combination of estimation techniques are given for the different source structures introduced for text-like data, and the modified PPM is shown experimentally to yield an improvement when combined with MDL estimation as well. (Less)

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/19019

author

Åberg, Jan ^LU

supervisor

[unknown] [unknown]

opponent

Professor Feder, Meir, Dept. of Electrical Engineering-Systems, Tel Aviv University, Tel Aviv, ISRAEL

organization

Department of Electrical and Information Technology

publishing date

1999

type

Thesis

publication status

published

subject

Electrical Engineering, Electronic Engineering, Information Engineering

keywords

coding with side information, multinomial estimation, maximum probability codes, MDL estimation, PPM, hierarchical source coding, universal source coding, text compression, lossless data compression, Automation, robotics, control engineering, Automatiska system, robotteknik, reglerteknik

pages

172 pages

publisher

Jan Åberg, Dept. of Information Technology, Lund University, P.O. Box 118, SE-221 00 Lund, SWEDEN,

defense location

E:1406, LTH

defense date

1998-10-15 10:00:00

external identifiers

other:ISRN: LUTEDX/TEIT-99/1014-SE

ISBN

91-7167-017-3

language

English

LU publication?

yes

id

772b415e-5d11-4406-bdec-fa98258b9442 (old id 19019)

date added to LUP

2016-04-04 12:05:29

date last changed

2018-11-21 21:08:56

@phdthesis{772b415e-5d11-4406-bdec-fa98258b9442,
  abstract     = {{The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an information-theoretical point of view, based on results and methods from universal source coding theory.<br/><br>
<br/><br>
A number of source classes for modeling text-like data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.<br/><br>
<br/><br>
Two subproblems of PPM are studied in detail; the problem of sequential multi-alphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side information in the form of observations from a source with similar parameters.<br/><br>
<br/><br>
For the multi-alphabet coding problem, several codes are presented, together with a sufficient condition for asymptotic optimality, which is satified by some of the introduced codes. Furthermore, the natural law of succession, a known solution of the underlying estimation problem, is analysed from the aspect of coding, and a method is given for calculation of the coding probabilities for a known multi-alphabet code.<br/><br>
<br/><br>
A code using side information from a similar source is presented and analysed, and is proved to asymptotically outperform codes not using side information if the divergence between the two sources is small.<br/><br>
<br/><br>
Based on the above results, modified PPM algorithms are proposed, and methods are developed for estimation of code parameters during encoding. It is experimentally verified that the introduced modifications improve the compression rate of PPM on two standard sets of test data. Several known PPM versions are shown to be characterizable in terms of the introduced source classes.<br/><br>
<br/><br>
The combination of PPM and MDL (minimum description length) model class estimation is investigated. Codes using this combination of estimation techniques are given for the different source structures introduced for text-like data, and the modified PPM is shown experimentally to yield an improvement when combined with MDL estimation as well.}},
  author       = {{Åberg, Jan}},
  isbn         = {{91-7167-017-3}},
  keywords     = {{coding with side information; multinomial estimation; maximum probability codes; MDL estimation; PPM; hierarchical source coding; universal source coding; text compression; lossless data compression; Automation; robotics; control engineering; Automatiska system; robotteknik; reglerteknik}},
  language     = {{eng}},
  publisher    = {{Jan Åberg, Dept. of Information Technology, Lund University, P.O. Box 118, SE-221 00 Lund, SWEDEN,}},
  school       = {{Lund University}},
  title        = {{A Universal Source Coding Perspective on PPM}},
  year         = {{1999}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

A Universal Source Coding Perspective on PPM