Pizza&Chili Corpus -- Compressed Indexes and their Testbeds

Repetitive Corpus

Here we present a corpus of repetitive texts. These texts are categorized according to the source they come from into the following: Artificial Texts, Pseudo-Real Texts, Real Texts, and and Logs. The main goal of this collection is to serve as a standard testbed for benchmarking algorithms oriented to repetitive texts.

Download

The files are compresed using p7zip and gzip for saving bandwidth.

Statistics

To understand the characteristics of the texts present in the Repetitive Corpus, we provide below some statistics about them. The statistics presented are the following:

Alphabet Size: We give the alphabet size and the inverse probability of matching (IPM), which is the inverse of the probability that two characters chosen at random match. IPM is a measure of the effective alphabet size. On a uniformly distributed text, it is precisely the alphabet size.
Compression Ratio: Since we are dealing with compressed indexes it is useful to have an idea of the compressibility of the texts using general-purpose compressors. The following compressors are used: gzip gives an idea of compressibility via dictionaries (an LZ77 parsing with limited window size); bzip2 gives an idea of block-sorting compressibility; ppmdi gives an idea of partial-match-based compressors (related to the k-th order entropy); p7zip gives an idea of LZ77 compression with virtually unlimited window; and Re-Pair gives an idea of grammar-based compression. All compressors were run with the highest compression options.
Empirical Entropy: Here we give the empirical entropy Hk of the text with k ranging from 0 to 8, measured as compression ratio. We also show, in parentheses, the number of contexts which count how many different substrings of a given size does T have.
Repetitiveness measures: We consider 5 repetitiveness measures. delta is the maximum of (number of substrings of length k) / k, for every possible k, of the text. z is the number of phrases in the Lempel-Ziv parsing of the text. v is the number of phrases in the smallest lexicographic parsing of the text. r is the number of runs of the BWT of the text. g is the size of the smallest context-free-grammar generating only the text (an approximation via Re-Pair).

The compression ratios are given as the percentage of the compressed file size over the uncompressed file size, assuming the original file uses one byte per character.

The compression statistics of all texts, as well as information of the origin of them, can be viewed in the following links:

Indexes

The following indexes are specifically oriented to repetitive texts.

Pizza&Chili CorpusCompressed Indexes and their Testbeds

Repetitive Corpus

Pizza&Chili Corpus
Compressed Indexes and their Testbeds