Structures of String Matching and Data Compression

Larsson, N Jesper

Structures of String Matching and Data Compression

Mark

Larsson, N Jesper ^LU (1999)

Abstract: This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations.

We study the suffix tree data structure, presenting an efficient representation and several generalizations. This includes augmenting the suffix tree to fully support sliding window indexing (including a practical implementation) in linear time. Furthermore, we consider a variant that indexes naturally word-partitioned data, and present a linear-time construction algorithm for a tree that represents only suffixes starting at word boundaries, requiring space... (More); This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations.

We study the suffix tree data structure, presenting an efficient representation and several generalizations. This includes augmenting the suffix tree to fully support sliding window indexing (including a practical implementation) in linear time. Furthermore, we consider a variant that indexes naturally word-partitioned data, and present a linear-time construction algorithm for a tree that represents only suffixes starting at word boundaries, requiring space linear in the number of words.

By applying our sliding window indexing techniques, we achieve an efficient implementation for dictionary-based compression based on the LZ-77 algorithm. Furthermore, considering predictive source modelling, we show that a PPM* style model can be maintained in linear time using arbitrarily bounded storage space.

We also consider the related problem of suffix sorting , applicable to suffix array construction and block sorting compression . We present an algorithm that eliminates superfluous processing of previous solutions while maintaining robust worst-case behaviour. We experimentally show favourable performance for a wide range of natural and degenerate inputs, and present a complete implementation.

Block sorting compression using BWT , the Burrows-Wheeler transform , has implicit structure closely related to context trees used in predictive modelling. We show how an explicit BWT context tree can be efficiently generated as a subset of the corresponding suffix tree and explore the central problems in using this structure. We experimentally evaluate prediction capabilities of the tree and consider representing it explicitly as part of the compressed data, arguing that a conscious treatment of the context tree can combine the compression performance of predictive modelling with the computational efficiency of BWT.

Finally, we explore offline dictionary-based compression , and present a semi-static source modelling scheme that obtains excellent compression, yet is also capable of high decoding rates. The amount of memory used by the decoder is flexible, and the compressed data has the potential of supporting direct search operations. (Less)
Abstract (Swedish): Popular Abstract in Swedish

Presenterar resultat inom algoritmer och datastrukturer för att hantera sekventiella data (strängar), med användningar bland annat inom datakomprimering. Särskilt behandlas varianter av suffixträd och näraliggande ämnen, som sortering av suffix. Vad gäller komprimering behandlas modellering av källor med hjälp av både strängbaserade metoder (t.ex. Ziv-Lempel) och probabilistiska metoder (t.ex. PPM och BWT). Både praktiska och teoretiska resultat ges; innehåller implementerade algoritmer i programspråket C.

Please use this url to cite or link to this publication: https://lup.lub.lu.se/record/19255

author

Larsson, N Jesper ^LU

supervisor

[unknown] [unknown]

opponent

Prof. Storer, James A., Brandeis University, Massachusetts, USA

organization

publishing date

1999

type

Thesis

publication status

published

subject

Computer Sciences

keywords

Implementation, Burrows-Wheeler Transform, Sliding Window, Suffix Sorting, Text Compression, Algorithms, Suffix Tree, Systems engineering, computer technology, Data- och systemvetenskap

pages

130 pages

publisher

Department of Computer Science, Lund University

defense location

Room E:1406, Electrical Engineering Building, Ole Römers väg 3

defense date

1999-09-17 10:15:00

external identifiers

other:ISRN: LUNFD6/(NFCS-1015)/1-130/(1999)

ISBN

91-628-3685-4

language

English

LU publication?

yes

id

c6646122-e480-4d3f-98f0-8a1b6cefea84 (old id 19255)

date added to LUP

2016-04-04 11:00:25

date last changed

2025-04-04 14:40:54

@phdthesis{c6646122-e480-4d3f-98f0-8a1b6cefea84,
  abstract     = {{This doctoral dissertation presents a range of results concerning efficient algorithms and data structures for string processing, including several schemes contributing to sequential data compression. It comprises both theoretic results and practical implementations.<br/><br>
<br/><br>
We study the suffix tree data structure, presenting an efficient representation and several generalizations. This includes augmenting the suffix tree to fully support sliding window indexing (including a practical implementation) in linear time. Furthermore, we consider a variant that indexes naturally word-partitioned data, and present a linear-time construction algorithm for a tree that represents only suffixes starting at word boundaries, requiring space linear in the number of words.<br/><br>
<br/><br>
By applying our sliding window indexing techniques, we achieve an efficient implementation for dictionary-based compression based on the LZ-77 algorithm. Furthermore, considering predictive source modelling, we show that a PPM* style model can be maintained in linear time using arbitrarily bounded storage space.<br/><br>
<br/><br>
We also consider the related problem of suffix sorting , applicable to suffix array construction and block sorting compression . We present an algorithm that eliminates superfluous processing of previous solutions while maintaining robust worst-case behaviour. We experimentally show favourable performance for a wide range of natural and degenerate inputs, and present a complete implementation.<br/><br>
<br/><br>
Block sorting compression using BWT , the Burrows-Wheeler transform , has implicit structure closely related to context trees used in predictive modelling. We show how an explicit BWT context tree can be efficiently generated as a subset of the corresponding suffix tree and explore the central problems in using this structure. We experimentally evaluate prediction capabilities of the tree and consider representing it explicitly as part of the compressed data, arguing that a conscious treatment of the context tree can combine the compression performance of predictive modelling with the computational efficiency of BWT.<br/><br>
<br/><br>
Finally, we explore offline dictionary-based compression , and present a semi-static source modelling scheme that obtains excellent compression, yet is also capable of high decoding rates. The amount of memory used by the decoder is flexible, and the compressed data has the potential of supporting direct search operations.}},
  author       = {{Larsson, N Jesper}},
  isbn         = {{91-628-3685-4}},
  keywords     = {{Implementation; Burrows-Wheeler Transform; Sliding Window; Suffix Sorting; Text Compression; Algorithms; Suffix Tree; Systems engineering; computer technology; Data- och systemvetenskap}},
  language     = {{eng}},
  publisher    = {{Department of Computer Science, Lund University}},
  school       = {{Lund University}},
  title        = {{Structures of String Matching and Data Compression}},
  year         = {{1999}},
}

Lund University Publications

LUND UNIVERSITY LIBRARIES

Structures of String Matching and Data Compression