Pizza&Chili Corpus
Compressed Indexes and their Testbeds

The Italian mirror | The Chilean mirror

Statistics for Artificial Collections

This subset is composed of highly repetitive texts that do not come from any real-life source, but are artificially generated through some mathematical definition and have interesting combinatorial properties.

Fibonacci Sequence F(n)

This sequence is defined by the recurrence

F(1) = 0

F(2) = 1

F(n) = F(n - 1) + F(n - 2)

The length of the string F(n) is the Fibonacci number f(n), and the sequence is a sturmian word, which means it has i + 1 different substrings (factors) of length i.

Thue-Morse Sequence T(n)

This sequence is defined by the recurrence

T(1)=0

T(n) = T(n-1)T(n-1)

where F is the bitwise negation operator. Because of the construction scheme of this sequence, there are many substrings of the form XX, for any string X. However, there are no overlapping squares, i.e., substrings of the form 0X0X0 or 1X1X1. Furthermore, this sequence is strongly cube-free, i.e., there are no substrings of the form XXx, where x is the first character of the string X. Another interesting property of this string is that it is recurrent. That is, given any finite substring w of length n, there is some length n(w) (often much longer than n) such that w is contained in every substring of length n(w). The length of these strings is |Tn|= 2n.

Run-Rich String Sequence R(n)

A measure of string complexity, related to the regularities of the text and strongly related to the LZ77 parsing, is the number of runs.

The substring T[i, j] is a run in a string T iff the minimum period of T[i, j] = p <= |T[i, j]| / 2 and T[i, j] is not extendable to the right (j = n or T[j + 1] /= T[j - p + 1]) or left (i = 1 or T[i - 1] /= T[i + p - 1]).

The higher the number of runs in a string, the more regularities it has.

It has been shown that the maximum number of runs in a string is greater than 0.944n and lower than 1.029n. Franek et al. show a constructive and simple way to obtain strings with many runs; the n-th of those strings is denoted R(n). The ratio of the runs of their strings to the length approaches 3/(1 + sqrt(5)) = 0.92705....

Collection Size (MiB) Alphabet size Inv match prob
F41 256MiB 2 1.894
T29 256MiB 2 2.000
R13 207MiB 2 2.000

Collection p7zip bzip2 gzip ppmdi Re-Pair
F41 0.17624% 0.00572% 0.46875% 1.87500% 0.00002%
T29 0.35896% 0.01259% 0.54688% 2.18750% 0.00004%
R13 0.17172% 0.01227% 0.53140% 2.12560% 0.00009%

Collection H0 H1 H2 H3 H4 H5 H6 H7 H8
F41 11.99%
(1)
7.41%
(2)
4.58%
(3)
4.58%
(4)
2.83%
(5)
2.83%
(6)
2.83%
(7)
1.75%
(8)
1.75%
(9)
T29 12.50%
(1)
11.48%
(2)
8.34%
(4)
8.34%
(6)
4.16%
(10)
4.16%
(12)
4.16%
(16)
2.09%
(20)
2.09%
(22)
R13 12.50%
(1)
9.85%
(2)
8.51%
(4)
6.55%
(6)
2.56%
(8)
2.33%
(10)
2.33%
(12)
2.33%
(14)
2.33%
(16)

Collection delta z v r g
F41 2.00 41 4 3 79
T29 3.33 56 43 81 139
R13 2.00 52 40 76 148


Send Mail to Us | © P. Ferragina and G. Navarro, Last update: October, 2010.