Pizza&Chili Corpus -- Compressed Indexes and their Testbeds

Alphabet-Friendly FM-index

Description

An Alphabet-Friendly FM-index combines an existing compression boosting technique with the wavelet tree data structure. This index scales well with the size of the input alphabet $\sigma$ . The size of the new index built on a string is bounded by $nH_k(T)+O(n \log\log n)/ \log_{\vert\Sigma\vert} n)$ bits, where is the -th order empirical entropy of .

The above bound holds simultaneously for all $k \leq \alpha \log_{\vert\Sigma\vert} n$ and $0 < \alpha < 1$ . Moreover, the index design does not depend on the parameter , which plays a role only in analysis of the space occupancy.

Using the AF-index, the counting of the occurrences of an arbitrary pattern as a substring of takes $O(p \log\vert\Sigma \vert)$ time. Locating each pattern occurrence takes $O(\log \vert\Sigma \vert (\log^2 n/ \log \log n))$ time. Reporting a text substring of length takes $O((l + \log^2 n/ \log \log n) \log \vert\Sigma \vert)$ time.

On the other hand, if we also have $\sigma=O($ polylog , the counting of the occurrences of an arbitrary pattern as a substring of takes time. Locating each pattern occurrence takes $O(\log^{1+\epsilon} n)$ time. Reporting a text substring of length takes $O(l + \log^{1+\epsilon} n)$ time.

Implementation

Rodrigo González

Build_options

free_text: The text will be freed immediately after using it.

samplerate=<number>: The index marks one text position every <number> entries. The default value is 64.

Papers

P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. An Alphabet-Friendly FM-index. In Proc. SPIRE'04, pages 150-160. LNCS 3246.

P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. Compressed Representation of Sequences and Full-Text Indexes. Technical Report 2004-05, Technische Fakultät, Universität Bielefeld, Germany, 2004.

P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. Compressed Representations of Sequences and Full-Text Indexes. To appear in ACM Transactions on Algorithms (TALG).

Downloads

ChangeLog

Pizza&Chili Corpus Compressed Indexes and their Testbeds

Alphabet-Friendly FM-index

Pizza&Chili Corpus
Compressed Indexes and their Testbeds