Pizza&Chili Corpus
Compressed Indexes and their Testbeds

The Italian mirror | The Chilean mirror

Alphabet-Friendly FM-index

Description

An Alphabet-Friendly FM-index combines an existing compression boosting technique with the wavelet tree data structure. This index scales well with the size of the input alphabet $ \sigma$ . The size of the new index built on a string $ T[1,n]$ is bounded by $ nH_k(T)+O(n \log\log n)/ \log_{\vert\Sigma\vert} n)$ bits, where $ H_k(T)$ is the $ k$ -th order empirical entropy of $ T$ .

The above bound holds simultaneously for all $ k \leq \alpha \log_{\vert\Sigma\vert} n$ and $ 0 < \alpha < 1$ . Moreover, the index design does not depend on the parameter $ k$ , which plays a role only in analysis of the space occupancy.

Using the AF-index, the counting of the occurrences of an arbitrary pattern $ P[1,p]$ as a substring of $ T$ takes $ O(p \log\vert\Sigma \vert)$ time. Locating each pattern occurrence takes $ O(\log \vert\Sigma \vert (\log^2 n/ \log \log n))$ time. Reporting a text substring of length $ l$ takes $ O((l + \log^2 n/ \log \log n) \log \vert\Sigma \vert)$ time.

On the other hand, if we also have $ \sigma=O($polylog$ (n))$ , the counting of the occurrences of an arbitrary pattern $ P[1,p]$ as a substring of $ T$ takes $ O(p)$ time. Locating each pattern occurrence takes $ O(\log^{1+\epsilon} n)$ time. Reporting a text substring of length $ l$ takes $ O(l + \log^{1+\epsilon} n)$ time.

Implementation

Rodrigo González

Build_options

free_text: The text will be freed immediately after using it.

samplerate=<number>: The index marks one text position every <number> entries. The default value is 64.

Papers

P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. An Alphabet-Friendly FM-index. In Proc. SPIRE'04, pages 150-160. LNCS 3246.

P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. Compressed Representation of Sequences and Full-Text Indexes. Technical Report 2004-05, Technische Fakultät, Universität Bielefeld, Germany, 2004.

P. Ferragina, G. Manzini, V. Mäkinen and G. Navarro. Compressed Representations of Sequences and Full-Text Indexes. To appear in ACM Transactions on Algorithms (TALG).

Downloads



Send Mail to Us | © P. Ferragina and G. Navarro, Last update: September, 2005.