Pizza&Chili Corpus
Compressed Indexes and their Testbeds

The Italian mirror | The Chilean mirror

Statistics for Pseudo-Real Collections

Here we present a set of texts that were generated by artificially adding repetitiveness to real texts, thus we call them pseudo-real texts.

To generate the texts, we took a prefix of 1MiB of all texts of Pizza&Chili Corpus and we mutated them. Our mutations take a random character position and change it to a random character different from the original one.

We used two different schemes for the mutations. The first one, denoted by a (1), generates different mutations of the first text. The second, denoted by a (2), mutates the last text generated. The second scheme resembles the changes obtained through time in a software project or the versions of a document.

The mutation rate, i.e., percentage of mutated characters, was set to 0.1%, 0.01% and 0.001%.

The base texts (all from the Pizza&Chili corpus) we mutated were the following:

  • Sources: This file is formed by C/Java source code obtained by concatenating all the .c, .h, .C and .java files of the linux-2.6.11.6 and gcc-4.0.0 distributions.
  • Pitches: This file is a sequence of midi pitch values (bytes in 0-127, plus a few extra special values) obtained from a myriad of MIDI files freely available on Internet.
  • Proteins: This file is a sequence of newline-separated protein sequences obtained from the Swissprot database.
  • DNA: This file is a sequence of newline-separated gene DNA sequences obtained from files 01hgp10 to 21hgp10, plus 0xhgp10 and 0yhgp10, from Gutenberg Project.
  • English: This file is the concatenation of English text files selected from etext02 to etext05 collections of Gutenberg Project.
  • XML: This file is an XML that provides bibliographic information on major computer science journals and proceedings and it was obtained from http: //dblp.uni-trier.de.
Collection Size (MiB) Alphabet size Inv match prob
Xml 0.001% (1) 100MiB 89 27.84
Xml 0.01% (1) 100MiB 89 27.84
Xml 0.1% (1) 100MiB 89 27.84
DNA 0.001% (1) 100MiB 5 3.98
DNA 0.01% (1) 100MiB 5 3.98
DNA 0.1% (1) 100MiB 5 3.98
English 0.001% (1) 100MiB 106 15.65
English 0.01% (1) 100MiB 106 15.65
English 0.1% (1) 100MiB 106 15.65
Pitches 0.001% (1) 100MiB 73 33.07
Pitches 0.01% (1) 100MiB 73 33.07
Pitches 0.1% (1) 100MiB 73 33.07
Proteins 0.001% (1) 100MiB 21 16.90
Proteins 0.01% (1) 100MiB 21 16.90
Proteins 0.1% (1) 100MiB 21 16.90
Sources 0.001% (1) 100MiB 98 28.86
Sources 0.01% (1) 100MiB 98 28.86
Sources 0.1% (1) 100MiB 98 28.86

Collection Size (MiB) Alphabet size Inv match prob
Xml 0.001% (2) 100MiB 89 27.84
Xml 0.01% (2) 100MiB 89 27.84
Xml 0.1% (2) 100MiB 89 27.86
DNA 0.001% (2) 100MiB 5 3.98
DNA 0.01% (2) 100MiB 5 3.98
DNA 0.1% (2) 100MiB 5 3.98
English 0.001% (2) 100MiB 106 15.65
English 0.01% (2) 100MiB 106 15.66
English 0.1% (2) 100MiB 106 15.74
Pitches 0.001% (2) 100MiB 73 33.07
Pitches 0.01% (2) 100MiB 73 33.07
Pitches 0.1% (2) 100MiB 73 33.10
Proteins 0.001% (2) 100MiB 21 16.90
Proteins 0.01% (2) 100MiB 21 16.90
Proteins 0.1% (2) 100MiB 21 16.92
Sources 0.001% (2) 100MiB 98 28.86
Sources 0.01% (2) 100MiB 98 28.86
Sources 0.1% (2) 100MiB 98 28.92

Collection p7zip bzip2 gzip ppmdi Re-Pair
Xml 0.001% (1) 0.15% 11.00% 18.00% 3.50% 0.19%
Xml 0.01% (1) 0.18% 12.00% 18.00% 3.60% 0.46%
Xml 0.1% (1) 0.46% 12.00% 18.00% 4.10% 2.00%
DNA 0.001% (1) 0.27% 27.00% 28.00% 11.00% 0.34%
DNA 0.01% (1) 0.29% 27.00% 28.00% 11.00% 0.58%
DNA 0.1% (1) 0.51% 27.00% 28.00% 12.00% 2.50%
English 0.001% (1) 0.31% 28.00% 37.00% 22.00% 0.39%
English 0.01% (1) 0.35% 28.00% 37.00% 22.00% 0.65%
English 0.1% (1) 0.59% 28.00% 37.00% 22.00% 2.70%
Pitches 0.001% (1) 0.47% 54.00% 52.00% 47.00% 0.69%
Pitches 0.01% (1) 0.50% 54.00% 52.00% 47.00% 0.95%
Pitches 0.1% (1) 0.75% 54.00% 52.00% 48.00% 3.20%
Proteins 0.001% (1) 0.32% 41.00% 39.00% 31.00% 0.42%
Proteins 0.01% (1) 0.35% 41.00% 39.00% 31.00% 0.68%
Proteins 0.1% (1) 0.59% 41.00% 39.00% 32.00% 2.70%
Sources 0.001% (1) 0.20% 19.00% 25.00% 12.00% 0.28%
Sources 0.01% (1) 0.23% 19.00% 25.00% 12.00% 0.56%
Sources 0.1% (1) 0.50% 20.00% 25.00% 13.00% 2.60%

Collection p7zip bzip2 gzip ppmdi Re-Pair
Xml 0.001% (2) 0.15% 12.00% 18.00% 3.50% 0.18%
Xml 0.01% (2) 0.18% 14.00% 19.00% 4.40% 0.39%
Xml 0.1% (2) 0.39% 25.00% 29.00% 17.00% 2.20%
DNA 0.001% (2) 0.26% 27.00% 28.00% 11.00% 0.33%
DNA 0.01% (2) 0.29% 27.00% 28.00% 11.00% 0.52%
DNA 0.1% (2) 0.46% 27.00% 28.00% 13.00% 2.20%
English 0.001% (2) 0.31% 28.00% 37.00% 22.00% 0.38%
English 0.01% (2) 0.34% 29.00% 37.00% 23.00% 0.59%
English 0.1% (2) 0.55% 38.00% 43.00% 31.00% 2.50%
Pitches 0.001% (2) 0.46% 54.00% 52.00% 47.00% 0.68%
Pitches 0.01% (2) 0.49% 54.00% 53.00% 48.00% 0.89%
Pitches 0.1% (2) 0.71% 59.00% 57.00% 52.00% 2.80%
Proteins 0.001% (2) 0.31% 41.00% 39.00% 32.00% 0.41%
Proteins 0.01% (2) 0.34% 42.00% 40.00% 33.00% 0.62%
Proteins 0.1% (2) 0.54% 47.00% 46.00% 40.00% 2.50%
Sources 0.001% (2) 0.20% 20.00% 25.00% 13.00% 0.27%
Sources 0.01% (2) 0.23% 21.00% 26.00% 14.00% 0.49%
Sources 0.1% (2) 0.44% 34.00% 35.00% 26.00% 2.50%

Collection H0 H1 H2 H3 H4 H5 H6 H7 H8
Xml
0.001% (1)
65.25%
(1)
38.63%
(89)
21.00%
(3,325)
12.50%
(20,560)
8.13%
(56,120)
6.00%
(98,084)
5.25%
(134,897)
4.75%
(168,846)
4.13%
(200,451)
Xml
0.01% (1)
65.25%
(1)
38.63%
(89)
21.00%
(4,135)
12.50%
(30,975)
8.13%
(79,379)
6.00%
(131,811)
5.25%
(177,924)
4.75%
(220,923)
4.13%
(261,651)
Xml
0.1% (1)
65.25%
(1)
38.75%
(89)
21.25%
(5,251)
12.75%
(67,479)
8.25%
(196,554)
6.13%
(326,296)
5.38%
(440,199)
4.88%
(550,570)
4.25%
(661,284)
DNA
0.001% (1)
25.00%
(1)
24.25%
(5)
24.13%
(18)
24.00%
(67)
24.00%
(260)
23.75%
(1,029)
23.50%
(4,102)
22.88%
(16,349)
21.25%
(62,437)
DNA
0.01% (1)
25.00%
(1)
24.25%
(5)
24.13%
(18)
24.00%
(67)
24.00%
(260)
23.75%
(1,029)
23.50%
(4,102)
22.88%
(16,368)
21.25%
(63,204)
DNA
0.1% (1)
25.00%
(1)
24.25%
(5)
24.13%
(19)
24.00%
(70)
24.00%
(264)
23.75%
(1,034)
23.50%
(4,109)
22.88%
(16,399)
21.38%
(65,168)
English
0.001% (1)
57.25%
(1)
45.13%
(106)
34.75%
(2,659)
25.88%
(18,352)
19.88%
(63,299)
15.88%
(145,194)
12.50%
(256,838)
9.63%
(379,514)
7.25%
(501,400)
English
0.01% (1)
57.25%
(1)
45.13%
(106)
34.75%
(3,243)
25.88%
(24,063)
19.88%
(82,896)
15.88%
(180,401)
12.50%
(305,292)
9.63%
(439,387)
7.25%
(572,056)
English
0.1% (1)
57.25%
(1)
45.25%
(106)
34.88%
(4,491)
26.13%
(46,116)
20.13%
(190,765)
16.00%
(439,130)
12.50%
(715,127)
9.75%
(983,435)
7.25%
(1,237,512)
Pitches
0.001% (1)
66.13%
(1)
61.00%
(73)
53.50%
(3,549)
37.13%
(73,664)
16.38%
(376,958)
6.25%
(642,406)
2.88%
(767,028)
1.38%
(833,456)
0.75%
(871,970)
Pitches
0.01% (1)
66.13%
(1)
61.00%
(73)
53.50%
(3,581)
37.25%
(76,900)
16.38%
(399,435)
6.25%
(684,445)
2.88%
(821,533)
1.38%
(898,126)
0.75%
(946,219)
Pitches
0.1% (1)
66.13%
(1)
61.13%
(73)
53.63%
(3,733)
37.38%
(95,838)
16.63%
(598,394)
6.38%
(1,096,014)
2.88%
(1,363,610)
1.50%
(1,543,086)
0.88%
(1,687,166)
Proteins
0.001% (1)
52.25%
(1)
52.13%
(21)
51.63%
(422)
47.50%
(8,045)
25.13%
(128,975)
4.63%
(463,357)
0.75%
(572,530)
0.25%
(589,356)
0.25%
(595,906)
Proteins
0.01% (1)
52.25%
(1)
52.13%
(21)
51.63%
(422)
47.50%
(8,045)
25.13%
(131,064)
4.63%
(494,845)
0.75%
(626,269)
0.25%
(654,067)
0.25%
(670,075)
Proteins
0.1% (1)
52.25%
(1)
52.13%
(21)
51.63%
(425)
47.50%
(8,076)
25.50%
(143,879)
4.88%
(768,510)
0.88%
(1,150,595)
0.38%
(1,293,347)
0.38%
(1,403,589)
Sources
0.001% (1)
68.75%
(1)
46.88%
(98)
30.00%
(4,557)
19.63%
(29,667)
14.38%
(75,316)
11.00%
(130,527)
8.38%
(194,105)
6.88%
(259,413)
5.75%
(320,468)
Sources
0.01% (1)
68.75%
(1)
46.88%
(98)
30.00%
(5,621)
19.63%
(42,303)
14.38%
(102,977)
11.00%
(170,525)
8.50%
(244,755)
6.88%
(320,237)
5.75%
(391,260)
Sources
0.1% (1)
68.75%
(1)
47.00%
(98)
30.25%
(7,359)
19.88%
(104,679)
14.63%
(299,799)
11.13%
(498,046)
8.50%
(687,941)
7.00%
(872,189)
5.88%
(1,049,051)

Collection H0 H1 H2 H3 H4 H5 H6 H7 H8
Xml
0.001% (2)
65.25%
(1)
38.63%
(89)
21.13%
(3,325)
12.63%
(20,560)
8.13%
(56,120)
6.00%
(98,084)
5.25%
(134,897)
4.75%
(168,846)
4.13%
(200,451)
Xml
0.01% (2)
65.25%
(1)
39.38%
(89)
22.00%
(4,135)
13.25%
(31,042)
8.63%
(79,630)
6.50%
(132,163)
5.63%
(178,388)
5.13%
(221,499)
4.50%
(262,329)
Xml
0.1% (2)
65.25%
(1)
44.00%
(89)
28.75%
(5,255)
18.50%
(72,227)
12.25%
(226,418)
9.25%
(378,994)
8.00%
(513,539)
7.13%
(645,141)
6.25%
(777,226)
DNA
0.001% (2)
25.00%
(1)
24.25%
(5)
24.13%
(18)
24.00%
(67)
24.00%
(260)
23.75%
(1,029)
23.50%
(4,102)
22.88%
(16,349)
21.25%
(62,436)
DNA
0.01% (2)
25.00%
(1)
24.25%
(5)
24.13%
(18)
24.13%
(67)
24.00%
(260)
23.88%
(1,029)
23.50%
(4,102)
23.00%
(16,369)
21.38%
(63,242)
DNA
0.1% (2)
25.00%
(1)
24.50%
(5)
24.38%
(19)
24.25%
(70)
24.25%
(264)
24.13%
(1,034)
23.88%
(4,109)
23.50%
(16,400)
22.38%
(65,387)
English
0.001% (2)
57.25%
(1)
45.13%
(106)
34.75%
(2,659)
26.00%
(18,353)
20.00%
(63,300)
15.88%
(145,195)
12.50%
(256,838)
9.63%
(379,514)
7.13%
(501,400)
English
0.01% (2)
57.25%
(1)
45.50%
(106)
35.38%
(3,243)
26.50%
(24,079)
20.25%
(83,037)
15.88%
(180,592)
12.38%
(305,458)
9.50%
(439,539)
7.13%
(572,186)
English
0.1% (2)
57.38%
(1)
47.75%
(106)
39.50%
(4,482)
31.13%
(47,357)
23.00%
(202,366)
16.63%
(466,838)
12.13%
(749,065)
8.88%
(1,015,587)
6.38%
(1,265,447)
Pitches
0.001% (2)
66.13%
(1)
61.13%
(73)
53.63%
(3,549)
37.25%
(73,664)
16.38%
(376,958)
6.25%
(642,406)
2.88%
(767,028)
1.38%
(833,456)
0.75%
(871,970)
Pitches
0.01% (2)
66.13%
(1)
61.13%
(73)
53.88%
(3,581)
37.50%
(76,917)
16.50%
(399,546)
6.38%
(684,518)
2.88%
(821,589)
1.38%
(898,152)
0.88%
(946,228)
Pitches
0.1% (2)
66.13%
(1)
62.00%
(73)
55.88%
(3,742)
40.25%
(96,359)
17.38%
(606,175)
6.50%
(1,103,560)
3.13%
(1,367,417)
1.88%
(1,545,154)
1.38%
(1,688,526)
Proteins
0.001% (2)
52.25%
(1)
52.13%
(21)
51.63%
(422)
47.50%
(8,045)
25.25%
(128,975)
4.63%
(463,357)
0.75%
(572,529)
0.25%
(589,356)
0.25%
(595,906)
Proteins
0.01% (2)
52.25%
(1)
52.13%
(21)
51.63%
(422)
47.63%
(8,045)
25.75%
(131,079)
5.00%
(494,846)
0.88%
(626,306)
0.50%
(654,107)
0.38%
(670,114)
Proteins
0.1% (2)
52.25%
(1)
52.13%
(21)
51.75%
(426)
48.75%
(8,072)
30.13%
(143,924)
7.63%
(771,311)
2.13%
(1,154,106)
1.50%
(1,297,080)
1.38%
(1,407,901)
Sources
0.001% (2)
68.75%
(1)
47.00%
(98)
30.00%
(4,557)
19.75%
(29,667)
14.38%
(75,316)
11.00%
(130,527)
8.50%
(194,105)
6.88%
(259,413)
5.75%
(320,468)
Sources
0.01% (2)
68.75%
(1)
47.50%
(98)
30.75%
(5,615)
20.13%
(42,337)
14.63%
(103,082)
11.13%
(170,646)
8.63%
(244,874)
7.00%
(320,346)
5.88%
(391,369)
Sources
0.1% (2)
68.75%
(1)
51.25%
(98)
36.63%
(7,372)
24.38%
(108,997)
16.75%
(319,310)
12.13%
(525,914)
9.13%
(718,657)
7.25%
(904,022)
6.00%
(1,080,824)

Collection delta z v r g
Xml (100MiB) 1,633,682 3,718,234 3,702,841 15,060,347 5,858,863
DNA (100MiB) 5,295,537 7,368,719 7,344,256 65,847,105 13,231,574
English (100MiB) 4,367,652 7,408,437 7,385,929 36,383,520 11,148,647
Pitches (50MiB) 3,125,803 5,782,285 5,755,922 22,527,480 10,263,313
Proteins (100MiB) 8,225,356 12,256,952 12,171,663 61,956,836 20,616,331
Sources (100MiB) 3,207,699 6,208,156 6,188,857 24,458,663 10,205,785


Send Mail to Us | © P. Ferragina and G. Navarro, Last update: October, 2010.