Advanced

Radix Sorting & Searching

Nilsson, Stefan (1996)
Abstract (Swedish)
Popular Abstract in Swedish

Effektiv sökning och sortering är byggstenar som används vid nästan all algoritmdesign. Sökrutiner och sorteringrutiner implementeras ofta slentrianmässigt med hjälp av algoritmer som bygger på parvisa jämförelser mellan element. I denna avhandling argumenterar vi för att algoritmer som också utnyttjar elementens interna struktur är överlägsna i många avseenden, både praktiskt och teoretiskt. De är oftast enkla och de kan tillämpas på de allra flesta sök- och sorteringsproblem.



Speciellt visar vi att strängsortering kan reduceras till heltalssortering på asymptotiskt optimal tid och vi visar att n heltal kan sorteras på tid proportionell mot n log log n i den allmänt... (More)
Popular Abstract in Swedish

Effektiv sökning och sortering är byggstenar som används vid nästan all algoritmdesign. Sökrutiner och sorteringrutiner implementeras ofta slentrianmässigt med hjälp av algoritmer som bygger på parvisa jämförelser mellan element. I denna avhandling argumenterar vi för att algoritmer som också utnyttjar elementens interna struktur är överlägsna i många avseenden, både praktiskt och teoretiskt. De är oftast enkla och de kan tillämpas på de allra flesta sök- och sorteringsproblem.



Speciellt visar vi att strängsortering kan reduceras till heltalssortering på asymptotiskt optimal tid och vi visar att n heltal kan sorteras på tid proportionell mot n log log n i den allmänt accepterade teoretiska modellen (RAM) av en sekventiell dator, en signifikant förbättring av den övre gränsen för hur snabbt det är möjligt att sortera. Algoritmen är enkel och förvånansvärt snabb också i praktiken.



Vi presenterar flera nya sorteringsalgoritmer. Bland annat introducerar vi en ny praktisk sorteringsalgoritm, Forward radixsort, som på ett enkelt sätt kombinerar fördelarna hos två välkända strängsorteringsalgorimer, MSD och LSD radixsort. Genom att lägga till ett förbehandlingssteg till Forward radixsort får vi en algoritm med intressanta teoretiska egenskaper. Vi implementerar några av de nya algoritmerna i programspråket C och visar att det är möjligt att sortera stora textdokument betydligt snabbare än med tidigare kända metoder.



Vi introducerar också en ny struktur, ett så kallat LC-trie, för textsökning. Strukturen har intressanta teoretiska egenskaper och den kan bland annat avändas för att bygga ett kompakt index som reducerar det totala antalet läsningar i externminne vid sökning i mycket stora textdatabaser. (Less)
Abstract
Efficient sorting and searching are corner-stones in algorithm design. In computer science it has become a deep-rooted habit to use comparison-based methods to solve these problems. In this thesis we argue that radix sorting and searching algorithms are superior in many respects, both practical and theoretical. They are often remarkably simple and many, if not most, sorting and searching problems may be cast in a form suitable for such algorithms.



In particular, we show that string sorting can be reduced to integer sorting in optimal asymptotic time and we show that a unit-cost RAM with a word length of w bits can sort n word-sized integers in O(n log log n) time for arbitrary w >= log n, a significantly improved... (More)
Efficient sorting and searching are corner-stones in algorithm design. In computer science it has become a deep-rooted habit to use comparison-based methods to solve these problems. In this thesis we argue that radix sorting and searching algorithms are superior in many respects, both practical and theoretical. They are often remarkably simple and many, if not most, sorting and searching problems may be cast in a form suitable for such algorithms.



In particular, we show that string sorting can be reduced to integer sorting in optimal asymptotic time and we show that a unit-cost RAM with a word length of w bits can sort n word-sized integers in O(n log log n) time for arbitrary w >= log n, a significantly improved upper bound for sorting. The algorithm is simple and runs surprisingly fast also in practice.



We introduce a new practical radix sorting algorithm, Forward radixsort, that combines the advantages of traditional LSD and MSD radixsort in a simple way. Adding a preprocessing step to Forward radixsort we obtain an algorithm with interesting theoretical properties. For example, if B / n = O(w), where B is the minimum number of bits that must be inspected to distinguish the strings, it is possible to sort n binary strings in O(n log(B / n log n)) time. The result can also be expressed in terms of the entropy H per bit of the input: n strings from a stationary ergodic process can be sorted in O(n log (1 / H)) time. We implement some of the new radix sorting algorithms in the C programming language and achieve sorting routines that run significantly faster than previously presented string sorting routines.



The other theme of the thesis is a new compression technique, level compression, for trie structures. A level-compressed trie, or LC-trie, is easy to implement and inherits the good properties of standard tries with respect to neighbor and range searches, while the average depth is significantly decreased. The expected average depth of an element in an LC-trie is proportional to the iterated logarithm function for uniformly distributed data and proportional to log log n for data from a Bernoulli-type distribution.



We also study the problem of string searching using suffix trees. Combining the methods of path compression, level compression, and data compression we achieve an efficient, compact, and fast implementation of a suffix tree. Based on extensive simulations, we argue that this new data structure is useful in many practical situations. For example, it can be used to reduce the number of accesses to secondary memory when searching in very large sets of data. (Less)
Please use this url to cite or link to this publication:
author
opponent
  • Prof. Gonnet, Gaston H., ETH Zürich
publishing date
type
Thesis
publication status
published
subject
keywords
Data- och systemvetenskap, Systems engineering, computer technology, LC-trie, Level Compression, Quadtree, Trie, Forward Radixsort, Adaptive Radixsort, String Sorting, Integer Sorting
pages
109 pages
publisher
Department of Computer Science, Lund University
defense location
Room E:1406, building E, Lund Institute of Technology.
defense date
1996-05-31 10:15
external identifiers
  • Other:ISRN: LUTEDX(TECS-1007)1-109/(1996)
ISBN
91-628-2089-3
language
English
LU publication?
no
id
b7683514-af28-4443-a07f-b08712bd85b7 (old id 17713)
date added to LUP
2007-05-24 09:13:02
date last changed
2016-09-19 08:45:03
@misc{b7683514-af28-4443-a07f-b08712bd85b7,
  abstract     = {Efficient sorting and searching are corner-stones in algorithm design. In computer science it has become a deep-rooted habit to use comparison-based methods to solve these problems. In this thesis we argue that radix sorting and searching algorithms are superior in many respects, both practical and theoretical. They are often remarkably simple and many, if not most, sorting and searching problems may be cast in a form suitable for such algorithms.<br/><br>
<br/><br>
In particular, we show that string sorting can be reduced to integer sorting in optimal asymptotic time and we show that a unit-cost RAM with a word length of w bits can sort n word-sized integers in O(n log log n) time for arbitrary w &gt;= log n, a significantly improved upper bound for sorting. The algorithm is simple and runs surprisingly fast also in practice.<br/><br>
<br/><br>
We introduce a new practical radix sorting algorithm, Forward radixsort, that combines the advantages of traditional LSD and MSD radixsort in a simple way. Adding a preprocessing step to Forward radixsort we obtain an algorithm with interesting theoretical properties. For example, if B / n = O(w), where B is the minimum number of bits that must be inspected to distinguish the strings, it is possible to sort n binary strings in O(n log(B / n log n)) time. The result can also be expressed in terms of the entropy H per bit of the input: n strings from a stationary ergodic process can be sorted in O(n log (1 / H)) time. We implement some of the new radix sorting algorithms in the C programming language and achieve sorting routines that run significantly faster than previously presented string sorting routines.<br/><br>
<br/><br>
The other theme of the thesis is a new compression technique, level compression, for trie structures. A level-compressed trie, or LC-trie, is easy to implement and inherits the good properties of standard tries with respect to neighbor and range searches, while the average depth is significantly decreased. The expected average depth of an element in an LC-trie is proportional to the iterated logarithm function for uniformly distributed data and proportional to log log n for data from a Bernoulli-type distribution.<br/><br>
<br/><br>
We also study the problem of string searching using suffix trees. Combining the methods of path compression, level compression, and data compression we achieve an efficient, compact, and fast implementation of a suffix tree. Based on extensive simulations, we argue that this new data structure is useful in many practical situations. For example, it can be used to reduce the number of accesses to secondary memory when searching in very large sets of data.},
  author       = {Nilsson, Stefan},
  isbn         = {91-628-2089-3},
  keyword      = {Data- och systemvetenskap,Systems engineering,computer technology,LC-trie,Level Compression,Quadtree,Trie,Forward Radixsort,Adaptive Radixsort,String Sorting,Integer Sorting},
  language     = {eng},
  pages        = {109},
  publisher    = {ARRAY(0x8bca640)},
  title        = {Radix Sorting & Searching},
  year         = {1996},
}