Description
The compressed suffix array was originally a data structure achieving
bits of space [1,2] and no
self-indexing. Sadakane's version [3,4,5]
is a self-index that achieves at least zero-order compression. We refer to the
latter with the name CSA in this paper.
The CSA represents the suffix array
by a sequence of numbers
, such that
. It is
shown [3] that
is piecewise monotone increasing.
More precisely,
increases in the areas of
where the suffixes start
with the same symbol
. In addition, there are long runs where
(those runs can be mapped one-to-one to the runs in
[6]).
These properties permit a compact representation of
.
Essentially,
is differentially encoded,
, and long runs of 1's in
the differences are run-length encoded. Absolute samples are stored at
regular intervals to permit the differential decoding start from the
last sample preceding position
. The sampling rate gives a space/time
tradeoff for accesing and storing
. In [3] it is
shown that the index requires
bits of space,
which is improved in [6] to
.
Implementation
Kunihiko Sadakane and refactory by Rodrigo Cánovas. Interface by Rodrigo González.
Build_options
free_text: The text will be freed immediately after using it.
samplerate=<number>: samplerate will be
interval between two indices of the suffix array stored explicitly.
That is, SA[i*samplerate] is stored explicitly. The default value is 16.
samplepsi=<number>: samplepsi will be
interval between two indices of the psi function stored explicitly.
That is, Psi[i*samplepsi] is stored explicitly. The default value is 128.
Papers
[1]
R. Grossi and J. Vitter.
Compressed suffix arrays and suffix trees with applications to text
indexing and string matching.
In Proceedings 32nd ACM Symposium on Theory of Computing
(STOC), pages 397-406, 2000.
[2]
R. Grossi and J. Vitter.
Compressed suffix arrays and suffix trees with applications to text
indexing and string matching.
SIAM Journal on Computing, 35(2):378-407, 2006.
[3]
K. Sadakane.
Compressed text databases with efficient query algorithms based on
the compressed suffix array.
In Proceedings 11th Annual International Symposium on Algorithms
and Computation (ISAAC), LNCS v. 1969, pages 410-421, 2000.
[4]
K. Sadakane.
Succinct representations of lcp information and improvements in the
compressed suffix arrays.
In Proceedings 13th Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA), 2002.
[5]
K. Sadakane.
New text indexing functionalities of the compressed suffix arrays.
Journal of Algorithms, 48(2):294-313, 2003.
[6]
G. Navarro and V. Mäkinen.
Compressed full-text indexes.
ACM Computing Surveys, 39(1):article 2, 2007.
Downloads
|