Pizza&Chili Corpus -- Compressed Indexes and their Testbeds

Compressed Suffix Array (CSA)

Description

The compressed suffix array was originally a data structure achieving $O(n\log\sigma)$ bits of space [1,2] and no self-indexing. Sadakane's version [3,4,5] is a self-index that achieves at least zero-order compression. We refer to the latter with the name CSA in this paper.

The CSA represents the suffix array by a sequence of numbers $\psi(i)$ , such that $A[\psi(i)]=A[i]+1$ . It is shown [3] that $\psi$ is piecewise monotone increasing. More precisely, $\psi$ increases in the areas of where the suffixes start with the same symbol . In addition, there are long runs where $\psi(i+1) = \psi(i)+1$ (those runs can be mapped one-to-one to the runs in $T^{bwt}$ [6]).

These properties permit a compact representation of $\psi$ . Essentially, $\psi(i)$ is differentially encoded, $\psi(i) -\psi(i-1)$ , and long runs of 1's in the differences are run-length encoded. Absolute samples are stored at regular intervals to permit the differential decoding start from the last sample preceding position . The sampling rate gives a space/time tradeoff for accesing and storing $\psi$ . In [3] it is shown that the index requires $O(nH_0(T)+n\log\log\sigma)$ bits of space, which is improved in [6] to $O(nH_k(T)+n\log\log\sigma)$ .

Implementation

Kunihiko Sadakane and refactory by Rodrigo Cánovas. Interface by Rodrigo González.

Build_options

free_text: The text will be freed immediately after using it.

samplerate=<number>: samplerate will be interval between two indices of the suffix array stored explicitly. That is, SA[i*samplerate] is stored explicitly. The default value is 16.

samplepsi=<number>: samplepsi will be interval between two indices of the psi function stored explicitly. That is, Psi[i*samplepsi] is stored explicitly. The default value is 128.

Papers

[1] R. Grossi and J. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings 32nd ACM Symposium on Theory of Computing (STOC), pages 397-406, 2000.

[2] R. Grossi and J. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2006.

[3] K. Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings 11th Annual International Symposium on Algorithms and Computation (ISAAC), LNCS v. 1969, pages 410-421, 2000.

[4] K. Sadakane. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.

[5] K. Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294-313, 2003.

[6] G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, 2007.

Downloads

sada_csa.tar
csaDNA.tar.gz Faster CSA implementation by K. Sadakane supporting only DNA files.

Pizza&Chili Corpus Compressed Indexes and their Testbeds

Compressed Suffix Array (CSA)

Pizza&Chili Corpus
Compressed Indexes and their Testbeds