Pizza&Chili Corpus -- Compressed Indexes and their Testbeds

Compressed Compact Suffix Array

Description

The Compressed Compact Suffix Array (CCSA) is a simple self-index using essentially the structures of the Compact Suffix Array (CSA). The CSA was aimed at explicitly exploiting the self-repetitions of the suffix array in order to store it in a more compact form.

The size of the index built on a string is bounded by $n(H_k(T)(1+\log n)+2+1/\epsilon)+O(n \log\log n)/ \log_{\vert\Sigma\vert} n)$ bits, where is the -th order empirical entropy of .

The above bound holds simultaneously for all $k \leq \alpha \log_{\vert\Sigma\vert} n$ with $0 < \alpha < 1$ . And $\epsilon>0$ is an arbitrary constant.

Using the CCSA, the counting of the occurrences of an arbitrary pattern as a substring of takes $O(p \log_{\vert\Sigma\vert} n)$ time. Locating each pattern occurrence takes $O(\epsilon \log_{\vert\Sigma \vert} n)$ time. Reporting a text substring of length takes $O((l + \epsilon \log n)$ time.

Implementation

Veli Mäkinen, Rodrigo González

Build_options

free_text: The text will be freed immediately after using it.

samplerate=<number>: The index marks one text position every <number> entries. The default value is 64.

Paper

V. Mäkinen and G. Navarro. Compressed compact suffix arrays. In Proc. 15th Annual Symposium on Combinatorial Pattern Matching (CPM), LNCS v. 3109 (2004), pp. 420-433.

Downloads

CCSA.tgz

Pizza&Chili Corpus Compressed Indexes and their Testbeds

Compressed Compact Suffix Array

Pizza&Chili Corpus
Compressed Indexes and their Testbeds