Pizza&Chili Corpus
Compressed Indexes and their Testbeds

The Italian mirror | The Chilean mirror

Statistics for Real Collections

This subset is composed of texts coming from real repetitive sources. These sources are DNA, Wikipedia Articles, Source Code, and Documents. For the case of DNA we concatenated the texts in random order. For the others, we concatenated the texts according to the date they were created, from oldest to newest.

DNA

Our DNA texts come from different sources.

  • The Saccharomyces Genome Resequencing Project provides two text collections: para, which contains 36 sequences of Saccharomyces Paradoxus and cere, which contains 37 sequences of Saccharomyces Cerevisiae.
  • From the National Center for Biotechnology Information (NCBI) we collected some DNA sequences of the same bacteria. The species we collected are Escherichia Coli (23), Salmonella Enterica (15), Staphylococcus Aureus (14), Streptococcus Pyogenes (13), Streptococcus Pneumoniae (11) and Clostridium Botulium (10). We chose these species as they were the only ones with 10 or more different sequences.
  • A collection composed of 78,041 sequences of Haemophilus Influenzae, also coming from the NCBI.

Although there are four bases {A, C, G, T}, DNA sequences may have alphabets of size up to 16 because some characters denote an unknown choice among the four bases. The most common character used is N, which denotes a totally unknown symbol.

Wikipedia Articles

We downloaded all versions of three Wikipedia articles, Albert Einstein, Alan Turing and Nobel Prize. We downloaded them in English (denoted en) and German (denoted de). We chose these languages as they are among the most widely used on Internet and their alphabet may be represented using standard 1-byte encodings. The versions for all documents are up to January 12, 2010, except for the English article of Albert Einstein, which was downloaded only up to November 10, 2006 because of the massive number of versions it has.

Source Code

We collected all versions 5.x of the Coreutils package and removed all binary files, making a total of 9 versions. We also collected all 1.0.x and 1.1.x versions of the Linux Kernel, making a total of 36 versions.

Documents

We took all pdf files of CIA World Leaders from January 2003 to December 2009, and converted them to text (using software pdftotext).

Collection Size (MiB) Alphabet size Inv match prob
Cere 440MiB 5 4.301
Para 410MiB 5 4.096
Clostridium Botulium 34MiB 4 3.356
Escherichia Coli 108MiB 15 4.000
Salmonella Enterica 66MiB 9 3.993
Staphylococcus Aureus 38MiB 5 3.579
Streptococcus Pneumoniae 23MiB 8 3.836
Streptococcus Pyogenes 24MIB 10 3.800
Influenza 148MiB 15 3.845
Coreutils 196MiB 236 19.553
Kernel 247MiB 160 23.078
Einstein (en) 446MiB 139 19.501
Einstein (de) 89MiB 117 19.264
Nobel (en) 85MiB 126 20.070
Nobel (de) 31MiB 118 17.786
Turing (en) 7.7MiB 103 21.096
Turing (de) 85MiB 100 19.719
World Leaders 45MiB 89 3.855

Collection p7zip bzip2 gzip ppmdi Re-Pair
Cere 1.14% 2.50% 26.36% 24.09% 1.86%
Para 1.46% 26.34% 27.07% 24.88% 2.80%
Clostridium Botulium 8.53% 25.88% 26.47% 24.12% 20.00%
Escherichia Coli 4.72% 26.85% 28.70% 25.93% 9.63%
Salmonella Enterica 5.61% 27.27% 28.79% 25.76% 12.42%
Staphylococcus Aureus 2.89% 26.32% 28.95% 25.00% 5.26%
Streptococcus Pneumoniae 4.78% 26.52% 27.39% 24.78% 9.57%
Streptococcus Pyogenes 5.00% 26.25% 27.08% 25.00% 9.58%
Influenza 1.35% 6.62% 7.43% 3.78% 3.31%
Coreutils 1.94% 16.33% 24.49% 12.76% 2.55%
Kernel 0.81% 21.86% 27.13% 18.62% 1.13%
Einstein (en) 0.07% 5.38% 35.20% 1.61% 0.10%
Einstein (de) 0.11% 4.38% 31.46% 1.35% 0.16%
Nobel (en) 0.13% 2.94% 18.82% 1.76% 0.20%
Nobel (de) 0.18% 3.55% 27.74% 1.68% 0.30%
Turing (en) 1.09% 36.36% 285.71% 15.58% 1.71%
Turing (de) 0.03% 0.18% 0.10% 0.11% 0.05%
World Leaders 1.29% 7.11% 17.78% 3.56% 1.78%

Collection H0 H1 H2 H3 H4 H5 H6 H7 H8
Cere 27.38%
(1)
22.63%
(5)
22.63%
(25)
22.50%
(125)
22.50%
(610)
22.50%
(2,515)
22.50%
(8,697)
22.38%
(28,080)
22.25%
(88,624)
Para 26.50%
(1)
23.50%
(5)
23.38%
(25)
23.38%
(125)
23.38%
(625)
23.38%
(3,125)
23.25%
(14,725)
23.25%
(51,542)
23.13%
(139,149)
Clostridium
Botulium
23.25%
(1)
23.00%
(4)
22.88%
(16)
22.75%
(64)
22.75%
(256)
22.75%
(1,024)
22.63%
(4,096)
22.50%
(16,383)
22.25%
(65,118)
Escherichia
Coli
25.00%
(1)
24.75%
(15)
24.50%
(145)
24.38%
(779)
24.25%
(2,715)
24.25%
(7,436)
24.13%
(15,641)
24.13%
(32,561)
23.88%
(85,363)
Salmonella
Enterica
25.00%
(1)
24.75%
(9)
24.50%
(35)
24.38%
(97)
24.25%
(299)
24.13%
(1,077)
24.13%
(4,159)
24.00%
(16,457)
23.75%
(65,618)
Staphylococcus
Aureus
23.88%
(1)
23.75%
(5)
23.75%
(18)
23.63%
(67)
23.63%
(260)
23.63%
(1,029)
23.50%
(4,102)
23.25%
(16,391)
22.75%
(65,282)
Streptococcus
Pneumoniae
24.63%
(1)
24.38%
(8)
24.38%
(31)
24.25%
(133)
24.13%
(574)
24.13%
(2,183)
24.00%
(6,928)
23.75%
(21,093)
23.13%
(71,592)
Streptococcus
Pyogenes
24.50%
(1)
24.38%
(10)
24.25%
(50)
24.13%
(174)
24.13%
(456)
24.13%
(1,291)
24.00%
(4,418)
23.88%
(16,758)
23.25%
(65,919)
Influenza 24.63%
(1)
24.13%
(15)
24.13%
(125)
24.00%
(583)
23.88%
(2,329)
23.50%
(7,978)
22.00%
(21,316)
18.63%
(44,748)
13.25%
(101,559)
Coreutils 68.38%
(1)
51.25%
(236)
35.88%
(18,500)
23.88%
(169,716)
17.00%
(606,527)
12.88%
(1,335,553)
10.13%
(2,258,650)
8.00%
(3,258,896)
6.50%
(4,247,313)
Kernel 67.25%
(1)
50.50%
(160)
36.63%
(7,122)
25.75%
(90,396)
19.25%
(351,918)
15.13%
(773,818)
12.13%
(1,305,616)
9.63%
(1,912,604)
7.75%
(2,553,008)
Einstein (en) 62.00%
(1)
46.38%
(139)
33.38%
(4,546)
21.13%
(28,685)
13.25%
(77,333)
9.00%
(142,559)
6.50%
(211,506)
4.75%
(276,343)
3.50%
(335,151)
Einstein (de) 63.00%
(1)
44.88%
(117)
32.63%
(3278)
20.88%
(16,765)
13.25%
(39,010)
9.00%
(64,884)
6.13%
(89,914)
4.38%
(112,043)
3.13%
(130,473)
Nobel (en) 62.63%
(1)
44.63%
(126)
30.50%
(3,566)
18.25%
(18,079)
11.50%
(42,334)
8.13%
(69,855)
6.00%
(95,644)
4.50%
(119,260)
3.38%
(140,401)
Nobel (de) 61.13%
(1)
43.25%
(118)
31.13%
(2,726)
19.63%
(12,959)
12.50%
(30,756)
8.63%
(49,695)
6.00%
(66,108)
4.13%
(80,467)
3.00%
(92,184)
Turing (en) 63.25%
(1)
45.75%
(103)
32.00%
(2,794)
19.13%
(14,091)
11.50%
(33,498)
7.63%
(55,489)
5.38%
(75,611)
3.88%
(93,402)
2.88%
(108,636)
Turing (de) 62.38%
(1)
43.25%
(100)
29.25%
(1,806)
16.75%
(7,268)
9.50%
(15,407)
6.00%
(23,070)
3.88%
(29,038)
2.63%
(33,714)
2.00%
(37,335)
World
Leaders
43.38%
(1)
24.38%
(89)
17.25%
(2,526)
11.63%
(23,924)
7.63%
(106,573)
5.13%
(246,566)
4.00%
(374,668)
3.50%
(468,701)
3.13%
(547,040)

Collection delta z v r g
Cere 1,003,280 1,700,630 1,649,448 11,574,640 4,069,452
Para 1,369,096 2,332,657 2,238,362 15,636,739 5,344,477
Escherichia Coli 1,337,977 2,078,512 2,014,012 15,044,487 4,342,874
Influenza 281,857 769,286 768,623 3,022,821 1,957,370
Coreutils 636,101 1,446,468 1,439,918 4,684,460 2,409,429
Kernel 405,643 793,915 794,058 2,791,367 1,374,651
Einstein (en) 42,884 89,467 97,442 290,238 212,902
Einstein (de) 16,309 34,572 37,721 101,369 84,499
World Leaders 68,651 175,740 179,696 573,487 399,667


Send Mail to Us | © P. Ferragina and G. Navarro, Last update: October, 2010.