Pizza&Chili Corpus
Compressed Indexes and their Testbeds

The Italian mirror | The Chilean mirror

Compressed Compact Suffix Array

Description

The Compressed Compact Suffix Array (CCSA) is a simple self-index using essentially the structures of the Compact Suffix Array (CSA). The CSA was aimed at explicitly exploiting the self-repetitions of the suffix array $ A$ in order to store it in a more compact form.

The size of the index built on a string $ T[1,n]$ is bounded by $ n(H_k(T)(1+\log n)+2+1/\epsilon)+O(n \log\log n)/ \log_{\vert\Sigma\vert} n)$ bits, where $ H_k(T)$ is the $ k$ -th order empirical entropy of $ T$ .

The above bound holds simultaneously for all $ k \leq \alpha \log_{\vert\Sigma\vert} n$ with $ 0 < \alpha < 1$ . And $ \epsilon>0$ is an arbitrary constant.

Using the CCSA, the counting of the occurrences of an arbitrary pattern $ P[1,p]$ as a substring of $ T$ takes $ O(p \log_{\vert\Sigma\vert} n)$ time. Locating each pattern occurrence takes $ O(\epsilon \log_{\vert\Sigma \vert} n)$ time. Reporting a text substring of length $ l$ takes $ O((l + \epsilon \log n)$ time.

Implementation

Veli Mäkinen, Rodrigo González

Build_options

free_text: The text will be freed immediately after using it.

samplerate=<number>: The index marks one text position every <number> entries. The default value is 64.

Paper

V. Mäkinen and G. Navarro. Compressed compact suffix arrays. In Proc. 15th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS v. 3109 (2004), pp. 420-433.

Downloads



Send Mail to Us | © P. Ferragina and G. Navarro, Last update: September, 2005.