This subset is composed of texts coming from real repetitive sources. These sources
are DNA, Wikipedia Articles, Source Code, and Documents. For the case of DNA we concatenated the texts in random order. For the others,
we concatenated the texts according to the date they were created, from oldest to
newest.
DNA
Our DNA texts come from different sources.
- The Saccharomyces Genome Resequencing Project provides two text collections:
para, which contains 36 sequences of Saccharomyces Paradoxus and cere,
which contains 37 sequences of Saccharomyces Cerevisiae.
- From the National Center for Biotechnology Information (NCBI) we collected
some DNA sequences of the same bacteria. The species we collected are Escherichia Coli (23), Salmonella Enterica (15),
Staphylococcus Aureus (14), Streptococcus Pyogenes (13),
Streptococcus Pneumoniae (11) and Clostridium Botulium (10). We chose these species as they were the only ones with 10 or more
different sequences.
- A collection composed of 78,041 sequences of Haemophilus Influenzae, also
coming from the NCBI.
Although there are four bases {A, C, G, T}, DNA sequences may have
alphabets of size up to 16 because some characters denote an unknown choice
among the four bases. The most common character used is N, which denotes a totally
unknown symbol.
Wikipedia Articles
We downloaded all versions of three Wikipedia articles, Albert Einstein, Alan Turing
and Nobel Prize. We downloaded them in English (denoted en) and German (denoted
de). We chose these languages as they are among the most widely used on Internet
and their alphabet may be represented using standard 1-byte encodings. The versions
for all documents are up to January 12, 2010, except for the English article of Albert
Einstein, which was downloaded only up to November 10, 2006 because of the massive
number of versions it has.
Source Code
We collected all versions 5.x of the Coreutils package and removed all binary files,
making a total of 9 versions. We also collected all 1.0.x and 1.1.x versions of the
Linux Kernel, making a total of 36 versions.
Documents
We took all pdf files of CIA World Leaders from January 2003 to December 2009,
and converted them to text (using software pdftotext).
Collection |
Size (MiB) |
Alphabet size |
Inv match prob |
Cere |
440MiB |
5 |
4.301 |
Para |
410MiB |
5 |
4.096 |
Clostridium Botulium |
34MiB |
4 |
3.356 |
Escherichia Coli |
108MiB |
15 |
4.000 |
Salmonella Enterica |
66MiB |
9 |
3.993 |
Staphylococcus Aureus |
38MiB |
5 |
3.579 |
Streptococcus Pneumoniae |
23MiB |
8 |
3.836 |
Streptococcus Pyogenes |
24MIB |
10 |
3.800 |
Influenza |
148MiB |
15 |
3.845 |
Coreutils |
196MiB |
236 |
19.553 |
Kernel |
247MiB |
160 |
23.078 |
Einstein (en) |
446MiB |
139 |
19.501 |
Einstein (de) |
89MiB |
117 |
19.264 |
Nobel (en) |
85MiB |
126 |
20.070 |
Nobel (de) |
31MiB |
118 |
17.786 |
Turing (en) |
7.7MiB |
103 |
21.096 |
Turing (de) |
85MiB |
100 |
19.719 |
World Leaders |
45MiB |
89 |
3.855 |
Collection |
p7zip |
bzip2 |
gzip |
ppmdi |
Re-Pair |
Cere |
1.14% |
2.50% |
26.36% |
24.09% |
1.86% |
Para |
1.46% |
26.34% |
27.07% |
24.88% |
2.80% |
Clostridium Botulium |
8.53% |
25.88% |
26.47% |
24.12% |
20.00% |
Escherichia Coli |
4.72% |
26.85% |
28.70% |
25.93% |
9.63% |
Salmonella Enterica |
5.61% |
27.27% |
28.79% |
25.76% |
12.42% |
Staphylococcus Aureus |
2.89% |
26.32% |
28.95% |
25.00% |
5.26% |
Streptococcus Pneumoniae |
4.78% |
26.52% |
27.39% |
24.78% |
9.57% |
Streptococcus Pyogenes |
5.00% |
26.25% |
27.08% |
25.00% |
9.58% |
Influenza |
1.35% |
6.62% |
7.43% |
3.78% |
3.31% |
Coreutils |
1.94% |
16.33% |
24.49% |
12.76% |
2.55% |
Kernel |
0.81% |
21.86% |
27.13% |
18.62% |
1.13% |
Einstein (en) |
0.07% |
5.38% |
35.20% |
1.61% |
0.10% |
Einstein (de) |
0.11% |
4.38% |
31.46% |
1.35% |
0.16% |
Nobel (en) |
0.13% |
2.94% |
18.82% |
1.76% |
0.20% |
Nobel (de) |
0.18% |
3.55% |
27.74% |
1.68% |
0.30% |
Turing (en) |
1.09% |
36.36% |
285.71% |
15.58% |
1.71% |
Turing (de) |
0.03% |
0.18% |
0.10% |
0.11% |
0.05% |
World Leaders |
1.29% |
7.11% |
17.78% |
3.56% |
1.78% |
Collection |
H0 |
H1 |
H2 |
H3 |
H4 |
H5 |
H6 |
H7 |
H8 |
Cere |
27.38% (1) |
22.63% (5) |
22.63% (25) |
22.50% (125) |
22.50% (610) |
22.50% (2,515) |
22.50% (8,697) |
22.38% (28,080) |
22.25% (88,624) |
Para |
26.50% (1) |
23.50% (5) |
23.38% (25) |
23.38% (125) |
23.38% (625) |
23.38% (3,125) |
23.25% (14,725) |
23.25% (51,542) |
23.13% (139,149) |
Clostridium Botulium |
23.25% (1) |
23.00% (4) |
22.88% (16) |
22.75% (64) |
22.75% (256) |
22.75% (1,024) |
22.63% (4,096) |
22.50% (16,383) |
22.25% (65,118) |
Escherichia Coli |
25.00% (1) |
24.75% (15) |
24.50% (145) |
24.38% (779) |
24.25% (2,715) |
24.25% (7,436) |
24.13% (15,641) |
24.13% (32,561) |
23.88% (85,363) |
Salmonella Enterica |
25.00% (1) |
24.75% (9) |
24.50% (35) |
24.38% (97) |
24.25% (299) |
24.13% (1,077) |
24.13% (4,159) |
24.00% (16,457) |
23.75% (65,618) |
Staphylococcus Aureus |
23.88% (1) |
23.75% (5) |
23.75% (18) |
23.63% (67) |
23.63% (260) |
23.63% (1,029) |
23.50% (4,102) |
23.25% (16,391) |
22.75% (65,282) |
Streptococcus Pneumoniae |
24.63% (1) |
24.38% (8) |
24.38% (31) |
24.25% (133) |
24.13% (574) |
24.13% (2,183) |
24.00% (6,928) |
23.75% (21,093) |
23.13% (71,592) |
Streptococcus Pyogenes |
24.50% (1) |
24.38% (10) |
24.25% (50) |
24.13% (174) |
24.13% (456) |
24.13% (1,291) |
24.00% (4,418) |
23.88% (16,758) |
23.25% (65,919) |
Influenza |
24.63% (1) |
24.13% (15) |
24.13% (125) |
24.00% (583) |
23.88% (2,329) |
23.50% (7,978) |
22.00% (21,316) |
18.63% (44,748) |
13.25% (101,559) |
Coreutils |
68.38% (1) |
51.25% (236) |
35.88% (18,500) |
23.88% (169,716) |
17.00% (606,527) |
12.88% (1,335,553) |
10.13% (2,258,650) |
8.00% (3,258,896) |
6.50% (4,247,313) |
Kernel |
67.25% (1) |
50.50% (160) |
36.63% (7,122) |
25.75% (90,396) |
19.25% (351,918) |
15.13% (773,818) |
12.13% (1,305,616) |
9.63% (1,912,604) |
7.75% (2,553,008) |
Einstein (en) |
62.00% (1) |
46.38% (139) |
33.38% (4,546) |
21.13% (28,685) |
13.25% (77,333) |
9.00% (142,559) |
6.50% (211,506) |
4.75% (276,343) |
3.50% (335,151) |
Einstein (de) |
63.00% (1) |
44.88% (117) |
32.63% (3278) |
20.88% (16,765) |
13.25% (39,010) |
9.00% (64,884) |
6.13% (89,914) |
4.38% (112,043) |
3.13% (130,473) |
Nobel (en) |
62.63% (1) |
44.63% (126) |
30.50% (3,566) |
18.25% (18,079) |
11.50% (42,334) |
8.13% (69,855) |
6.00% (95,644) |
4.50% (119,260) |
3.38% (140,401) |
Nobel (de) |
61.13% (1) |
43.25% (118) |
31.13% (2,726) |
19.63% (12,959) |
12.50% (30,756) |
8.63% (49,695) |
6.00% (66,108) |
4.13% (80,467) |
3.00% (92,184) |
Turing (en) |
63.25% (1) |
45.75% (103) |
32.00% (2,794) |
19.13% (14,091) |
11.50% (33,498) |
7.63% (55,489) |
5.38% (75,611) |
3.88% (93,402) |
2.88% (108,636) |
Turing (de) |
62.38% (1) |
43.25% (100) |
29.25% (1,806) |
16.75% (7,268) |
9.50% (15,407) |
6.00% (23,070) |
3.88% (29,038) |
2.63% (33,714) |
2.00% (37,335) |
World Leaders |
43.38% (1) |
24.38% (89) |
17.25% (2,526) |
11.63% (23,924) |
7.63% (106,573) |
5.13% (246,566) |
4.00% (374,668) |
3.50% (468,701) |
3.13% (547,040) |
Collection |
delta |
z |
v |
r |
g |
Cere |
1,003,280 |
1,700,630 |
1,649,448 |
11,574,640 |
4,069,452 |
Para |
1,369,096 |
2,332,657 |
2,238,362 |
15,636,739 |
5,344,477 |
Escherichia Coli |
1,337,977 |
2,078,512 |
2,014,012 |
15,044,487 |
4,342,874 |
Influenza |
281,857 |
769,286 |
768,623 |
3,022,821 |
1,957,370 |
Coreutils |
636,101 |
1,446,468 |
1,439,918 |
4,684,460 |
2,409,429 |
Kernel |
405,643 |
793,915 |
794,058 |
2,791,367 |
1,374,651 |
Einstein (en) |
42,884 |
89,467 |
97,442 |
290,238 |
212,902 |
Einstein (de) |
16,309 |
34,572 |
37,721 |
101,369 |
84,499 |
World Leaders |
68,651 |
175,740 |
179,696 |
573,487 |
399,667 |
Send Mail to Us | © P. Ferragina and G. Navarro, Last update: October, 2010.
|