Here we present a corpus of repetitive texts. These texts are categorized
according to the source they come from into the following: Artificial Texts, Pseudo-Real Texts, Real Texts, and and Logs. The main goal of this collection is to serve as a standard testbed for benchmarking algorithms oriented to repetitive texts.
The files are compresed using p7zip and gzip for saving bandwidth.
To understand the characteristics of the texts present in the Repetitive Corpus, we
provide below some statistics about them. The statistics presented are the following:
- Alphabet Size: We give the alphabet size and the inverse probability of matching (IPM), which is the inverse of the probability that two characters chosen at
random match. IPM is a measure of the effective alphabet size. On a uniformly
distributed text, it is precisely the alphabet size.
- Compression Ratio: Since we are dealing with compressed indexes it is useful
to have an idea of the compressibility of the texts using general-purpose
compressors. The following compressors are used: gzip gives an idea of compressibility
via dictionaries (an LZ77 parsing with limited window size); bzip2
gives an idea of block-sorting compressibility; ppmdi gives an idea of partial-match-based compressors (related
to the k-th order entropy); p7zip gives an idea of LZ77 compression with virtually unlimited window; and Re-Pair gives an idea of grammar-based compression.
All compressors were run with the highest compression options.
- Empirical Entropy: Here we give the empirical entropy Hk of the text with
k ranging from 0 to 8, measured as compression ratio. We also show, in parentheses, the number of contexts which
count how many different substrings of a given size does T have.
- Repetitiveness measures: We consider 5 repetitiveness measures.
delta is the maximum of (number of substrings of length k) / k, for every possible k, of the text.
z is the number of phrases in the Lempel-Ziv parsing of the text.
v is the number of phrases in the smallest lexicographic parsing of the text.
r is the number of runs of the BWT of the text.
g is the size of the smallest context-free-grammar generating only the text (an approximation via Re-Pair).
The compression ratios are given as the percentage of the compressed
file size over the uncompressed file size, assuming the original file uses one byte per
The compression statistics of all texts, as well as information of the origin of them, can be viewed in the following links:
The following indexes are specifically oriented to repetitive texts.
Send Mail to Us
| © P. Ferragina and G. Navarro, Last update: October, 2010.