Compressed Suffix Array (CSA) |
Description The compressed suffix array was originally a data structure achieving bits of space [1,2] and no self-indexing. Sadakane's version [3,4,5] is a self-index that achieves at least zero-order compression. We refer to the latter with the name CSA in this paper. The CSA represents the suffix array by a sequence of numbers , such that . It is shown [3] that is piecewise monotone increasing. More precisely, increases in the areas of where the suffixes start with the same symbol . In addition, there are long runs where (those runs can be mapped one-to-one to the runs in [6]). These properties permit a compact representation of . Essentially, is differentially encoded, , and long runs of 1's in the differences are run-length encoded. Absolute samples are stored at regular intervals to permit the differential decoding start from the last sample preceding position . The sampling rate gives a space/time tradeoff for accesing and storing . In [3] it is shown that the index requires bits of space, which is improved in [6] to . Implementation Kunihiko Sadakane and refactory by Rodrigo Cánovas. Interface by Rodrigo González. Build_options free_text: The text will be freed immediately after using it. samplerate=<number>: samplerate will be interval between two indices of the suffix array stored explicitly. That is, SA[i*samplerate] is stored explicitly. The default value is 16. samplepsi=<number>: samplepsi will be interval between two indices of the psi function stored explicitly. That is, Psi[i*samplepsi] is stored explicitly. The default value is 128. Papers
Downloads
|