A Universal Source Coding Perspective on PPM
(1999) Abstract
 The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an informationtheoretical point of view, based on results and methods from universal source coding theory.
A number of source classes for modeling textlike data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.
Two subproblems of PPM are studied in detail; the problem of sequential multialphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side... (More)  The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an informationtheoretical point of view, based on results and methods from universal source coding theory.
A number of source classes for modeling textlike data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.
Two subproblems of PPM are studied in detail; the problem of sequential multialphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side information in the form of observations from a source with similar parameters.
For the multialphabet coding problem, several codes are presented, together with a sufficient condition for asymptotic optimality, which is satified by some of the introduced codes. Furthermore, the natural law of succession, a known solution of the underlying estimation problem, is analysed from the aspect of coding, and a method is given for calculation of the coding probabilities for a known multialphabet code.
A code using side information from a similar source is presented and analysed, and is proved to asymptotically outperform codes not using side information if the divergence between the two sources is small.
Based on the above results, modified PPM algorithms are proposed, and methods are developed for estimation of code parameters during encoding. It is experimentally verified that the introduced modifications improve the compression rate of PPM on two standard sets of test data. Several known PPM versions are shown to be characterizable in terms of the introduced source classes.
The combination of PPM and MDL (minimum description length) model class estimation is investigated. Codes using this combination of estimation techniques are given for the different source structures introduced for textlike data, and the modified PPM is shown experimentally to yield an improvement when combined with MDL estimation as well. (Less)
Please use this url to cite or link to this publication:
https://lup.lub.lu.se/record/19019
 author
 Åberg, Jan ^{LU}
 supervisor
 opponent

 Professor Feder, Meir, Dept. of Electrical EngineeringSystems, Tel Aviv University, Tel Aviv, ISRAEL
 organization
 publishing date
 1999
 type
 Thesis
 publication status
 published
 subject
 keywords
 coding with side information, multinomial estimation, maximum probability codes, MDL estimation, PPM, hierarchical source coding, universal source coding, text compression, lossless data compression, Automation, robotics, control engineering, Automatiska system, robotteknik, reglerteknik
 pages
 172 pages
 publisher
 Jan Åberg, Dept. of Information Technology, Lund University, P.O. Box 118, SE221 00 Lund, SWEDEN,
 defense location
 E:1406, LTH
 defense date
 19981015 10:00:00
 external identifiers

 other:ISRN: LUTEDX/TEIT99/1014SE
 ISBN
 9171670173
 language
 English
 LU publication?
 yes
 id
 772b415e5d114406bdecfa98258b9442 (old id 19019)
 date added to LUP
 20160404 12:05:29
 date last changed
 20181121 21:08:56
@phdthesis{772b415e5d114406bdecfa98258b9442, abstract = {{The PPM (Prediction by Partial Matching) family of text compression algorithms has several members that have shown to be very efficient in practice. This thesis treats PPM algorithms from an informationtheoretical point of view, based on results and methods from universal source coding theory.<br/><br> <br/><br> A number of source classes for modeling textlike data are developed, that are distinguished both by structural properties and by restrictions imposed on the possible combinations of parameter values.<br/><br> <br/><br> Two subproblems of PPM are studied in detail; the problem of sequential multialphabet coding, i.e., coding for memoryless sources with unknown alphabet, and the problem of coding for memoryless sources with side information in the form of observations from a source with similar parameters.<br/><br> <br/><br> For the multialphabet coding problem, several codes are presented, together with a sufficient condition for asymptotic optimality, which is satified by some of the introduced codes. Furthermore, the natural law of succession, a known solution of the underlying estimation problem, is analysed from the aspect of coding, and a method is given for calculation of the coding probabilities for a known multialphabet code.<br/><br> <br/><br> A code using side information from a similar source is presented and analysed, and is proved to asymptotically outperform codes not using side information if the divergence between the two sources is small.<br/><br> <br/><br> Based on the above results, modified PPM algorithms are proposed, and methods are developed for estimation of code parameters during encoding. It is experimentally verified that the introduced modifications improve the compression rate of PPM on two standard sets of test data. Several known PPM versions are shown to be characterizable in terms of the introduced source classes.<br/><br> <br/><br> The combination of PPM and MDL (minimum description length) model class estimation is investigated. Codes using this combination of estimation techniques are given for the different source structures introduced for textlike data, and the modified PPM is shown experimentally to yield an improvement when combined with MDL estimation as well.}}, author = {{Åberg, Jan}}, isbn = {{9171670173}}, keywords = {{coding with side information; multinomial estimation; maximum probability codes; MDL estimation; PPM; hierarchical source coding; universal source coding; text compression; lossless data compression; Automation; robotics; control engineering; Automatiska system; robotteknik; reglerteknik}}, language = {{eng}}, publisher = {{Jan Åberg, Dept. of Information Technology, Lund University, P.O. Box 118, SE221 00 Lund, SWEDEN,}}, school = {{Lund University}}, title = {{A Universal Source Coding Perspective on PPM}}, year = {{1999}}, }