Pizza&Chili Corpus
Compressed Indexes and their Testbeds

The Italian mirror | The Chilean mirror

Compressed Suffix Array (CSA)

Description

The compressed suffix array was originally a data structure achieving $ O(n\log\sigma)$ bits of space [1,2] and no self-indexing. Sadakane's version [3,4,5] is a self-index that achieves at least zero-order compression. We refer to the latter with the name CSA in this paper.

The CSA represents the suffix array $ A[1,n]$ by a sequence of numbers $ \psi(i)$ , such that $ A[\psi(i)]=A[i]+1$ . It is shown [3] that $ \psi$ is piecewise monotone increasing. More precisely, $ \psi$ increases in the areas of $ A$ where the suffixes start with the same symbol $ c$ . In addition, there are long runs where $ \psi(i+1) =
\psi(i)+1$ (those runs can be mapped one-to-one to the runs in $ T^{bwt}$ [6]).

These properties permit a compact representation of $ \psi$ . Essentially, $ \psi(i)$ is differentially encoded, $ \psi(i) -\psi(i-1)$ , and long runs of 1's in the differences are run-length encoded. Absolute samples are stored at regular intervals to permit the differential decoding start from the last sample preceding position $ i$ . The sampling rate gives a space/time tradeoff for accesing and storing $ \psi$ . In [3] it is shown that the index requires $ O(nH_0(T)+n\log\log\sigma)$ bits of space, which is improved in [6] to $ O(nH_k(T)+n\log\log\sigma)$ .

Implementation

Kunihiko Sadakane and refactory by Rodrigo Cánovas. Interface by Rodrigo González.

Build_options

free_text: The text will be freed immediately after using it.

samplerate=<number>: samplerate will be interval between two indices of the suffix array stored explicitly. That is, SA[i*samplerate] is stored explicitly. The default value is 16.

samplepsi=<number>: samplepsi will be interval between two indices of the psi function stored explicitly. That is, Psi[i*samplepsi] is stored explicitly. The default value is 128.

Papers

[1] R. Grossi and J. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings 32nd ACM Symposium on Theory of Computing (STOC), pages 397-406, 2000.

[2] R. Grossi and J. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2006.

[3] K. Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings 11th Annual International Symposium on Algorithms and Computation (ISAAC), LNCS v. 1969, pages 410-421, 2000.

[4] K. Sadakane. Succinct representations of lcp information and improvements in the compressed suffix arrays. In Proceedings 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.

[5] K. Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294-313, 2003.

[6] G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):article 2, 2007.

Downloads



Send Mail to Us | © P. Ferragina and G. Navarro, Last update: September, 2005.