Here we present a set of texts that were generated by artificially adding repetitiveness
to real texts, thus we call them pseudo-real texts.
To generate the texts, we took a prefix of 1MiB of all texts of Pizza&Chili Corpus
and we mutated them. Our mutations take a random character position and change
it to a random character different from the original one.
We used two different schemes for the mutations. The first one, denoted by a (1),
generates different mutations of the first text. The second, denoted by a (2), mutates
the last text generated. The second scheme resembles the changes obtained through
time in a software project or the versions of a document.
The mutation rate, i.e., percentage of mutated characters, was set to 0.1%, 0.01%
and 0.001%.
The base texts (all from the Pizza&Chili corpus) we mutated were the following:
- Sources: This file is formed by C/Java source code obtained by concatenating all
the .c, .h, .C and .java files of the linux-2.6.11.6 and gcc-4.0.0 distributions.
- Pitches: This file is a sequence of midi pitch values (bytes in 0-127, plus a few
extra special values) obtained from a myriad of MIDI files freely available on
Internet.
- Proteins: This file is a sequence of newline-separated protein sequences obtained
from the Swissprot database.
- DNA: This file is a sequence of newline-separated gene DNA sequences obtained
from files 01hgp10 to 21hgp10, plus 0xhgp10 and 0yhgp10, from Gutenberg
Project.
- English: This file is the concatenation of English text files selected from etext02
to etext05 collections of Gutenberg Project.
- XML: This file is an XML that provides bibliographic information on major
computer science journals and proceedings and it was obtained from http:
//dblp.uni-trier.de.
Collection |
Size (MiB) |
Alphabet size |
Inv match prob |
Xml 0.001% (1) |
100MiB |
89 |
27.84 |
Xml 0.01% (1) |
100MiB |
89 |
27.84 |
Xml 0.1% (1) |
100MiB |
89 |
27.84 |
DNA 0.001% (1) |
100MiB |
5 |
3.98 |
DNA 0.01% (1) |
100MiB |
5 |
3.98 |
DNA 0.1% (1) |
100MiB |
5 |
3.98 |
English 0.001% (1) |
100MiB |
106 |
15.65 |
English 0.01% (1) |
100MiB |
106 |
15.65 |
English 0.1% (1) |
100MiB |
106 |
15.65 |
Pitches 0.001% (1) |
100MiB |
73 |
33.07 |
Pitches 0.01% (1) |
100MiB |
73 |
33.07 |
Pitches 0.1% (1) |
100MiB |
73 |
33.07 |
Proteins 0.001% (1) |
100MiB |
21 |
16.90 |
Proteins 0.01% (1) |
100MiB |
21 |
16.90 |
Proteins 0.1% (1) |
100MiB |
21 |
16.90 |
Sources 0.001% (1) |
100MiB |
98 |
28.86 |
Sources 0.01% (1) |
100MiB |
98 |
28.86 |
Sources 0.1% (1) |
100MiB |
98 |
28.86 |
Collection |
Size (MiB) |
Alphabet size |
Inv match prob |
Xml 0.001% (2) |
100MiB |
89 |
27.84 |
Xml 0.01% (2) |
100MiB |
89 |
27.84 |
Xml 0.1% (2) |
100MiB |
89 |
27.86 |
DNA 0.001% (2) |
100MiB |
5 |
3.98 |
DNA 0.01% (2) |
100MiB |
5 |
3.98 |
DNA 0.1% (2) |
100MiB |
5 |
3.98 |
English 0.001% (2) |
100MiB |
106 |
15.65 |
English 0.01% (2) |
100MiB |
106 |
15.66 |
English 0.1% (2) |
100MiB |
106 |
15.74 |
Pitches 0.001% (2) |
100MiB |
73 |
33.07 |
Pitches 0.01% (2) |
100MiB |
73 |
33.07 |
Pitches 0.1% (2) |
100MiB |
73 |
33.10 |
Proteins 0.001% (2) |
100MiB |
21 |
16.90 |
Proteins 0.01% (2) |
100MiB |
21 |
16.90 |
Proteins 0.1% (2) |
100MiB |
21 |
16.92 |
Sources 0.001% (2) |
100MiB |
98 |
28.86 |
Sources 0.01% (2) |
100MiB |
98 |
28.86 |
Sources 0.1% (2) |
100MiB |
98 |
28.92 |
Collection |
p7zip |
bzip2 |
gzip |
ppmdi |
Re-Pair |
Xml 0.001% (1) |
0.15% |
11.00% |
18.00% |
3.50% |
0.19% |
Xml 0.01% (1) |
0.18% |
12.00% |
18.00% |
3.60% |
0.46% |
Xml 0.1% (1) |
0.46% |
12.00% |
18.00% |
4.10% |
2.00% |
DNA 0.001% (1) |
0.27% |
27.00% |
28.00% |
11.00% |
0.34% |
DNA 0.01% (1) |
0.29% |
27.00% |
28.00% |
11.00% |
0.58% |
DNA 0.1% (1) |
0.51% |
27.00% |
28.00% |
12.00% |
2.50% |
English 0.001% (1) |
0.31% |
28.00% |
37.00% |
22.00% |
0.39% |
English 0.01% (1) |
0.35% |
28.00% |
37.00% |
22.00% |
0.65% |
English 0.1% (1) |
0.59% |
28.00% |
37.00% |
22.00% |
2.70% |
Pitches 0.001% (1) |
0.47% |
54.00% |
52.00% |
47.00% |
0.69% |
Pitches 0.01% (1) |
0.50% |
54.00% |
52.00% |
47.00% |
0.95% |
Pitches 0.1% (1) |
0.75% |
54.00% |
52.00% |
48.00% |
3.20% |
Proteins 0.001% (1) |
0.32% |
41.00% |
39.00% |
31.00% |
0.42% |
Proteins 0.01% (1) |
0.35% |
41.00% |
39.00% |
31.00% |
0.68% |
Proteins 0.1% (1) |
0.59% |
41.00% |
39.00% |
32.00% |
2.70% |
Sources 0.001% (1) |
0.20% |
19.00% |
25.00% |
12.00% |
0.28% |
Sources 0.01% (1) |
0.23% |
19.00% |
25.00% |
12.00% |
0.56% |
Sources 0.1% (1) |
0.50% |
20.00% |
25.00% |
13.00% |
2.60% |
Collection |
p7zip |
bzip2 |
gzip |
ppmdi |
Re-Pair |
Xml 0.001% (2) |
0.15% |
12.00% |
18.00% |
3.50% |
0.18% |
Xml 0.01% (2) |
0.18% |
14.00% |
19.00% |
4.40% |
0.39% |
Xml 0.1% (2) |
0.39% |
25.00% |
29.00% |
17.00% |
2.20% |
DNA 0.001% (2) |
0.26% |
27.00% |
28.00% |
11.00% |
0.33% |
DNA 0.01% (2) |
0.29% |
27.00% |
28.00% |
11.00% |
0.52% |
DNA 0.1% (2) |
0.46% |
27.00% |
28.00% |
13.00% |
2.20% |
English 0.001% (2) |
0.31% |
28.00% |
37.00% |
22.00% |
0.38% |
English 0.01% (2) |
0.34% |
29.00% |
37.00% |
23.00% |
0.59% |
English 0.1% (2) |
0.55% |
38.00% |
43.00% |
31.00% |
2.50% |
Pitches 0.001% (2) |
0.46% |
54.00% |
52.00% |
47.00% |
0.68% |
Pitches 0.01% (2) |
0.49% |
54.00% |
53.00% |
48.00% |
0.89% |
Pitches 0.1% (2) |
0.71% |
59.00% |
57.00% |
52.00% |
2.80% |
Proteins 0.001% (2) |
0.31% |
41.00% |
39.00% |
32.00% |
0.41% |
Proteins 0.01% (2) |
0.34% |
42.00% |
40.00% |
33.00% |
0.62% |
Proteins 0.1% (2) |
0.54% |
47.00% |
46.00% |
40.00% |
2.50% |
Sources 0.001% (2) |
0.20% |
20.00% |
25.00% |
13.00% |
0.27% |
Sources 0.01% (2) |
0.23% |
21.00% |
26.00% |
14.00% |
0.49% |
Sources 0.1% (2) |
0.44% |
34.00% |
35.00% |
26.00% |
2.50% |
Collection |
H0 |
H1 |
H2 |
H3 |
H4 |
H5 |
H6 |
H7 |
H8 |
Xml 0.001% (1) |
65.25% (1) |
38.63% (89) |
21.00% (3,325) |
12.50% (20,560) |
8.13% (56,120) |
6.00% (98,084) |
5.25% (134,897) |
4.75% (168,846) |
4.13% (200,451) |
Xml 0.01% (1) |
65.25% (1) |
38.63% (89) |
21.00% (4,135) |
12.50% (30,975) |
8.13% (79,379) |
6.00% (131,811) |
5.25% (177,924) |
4.75% (220,923) |
4.13% (261,651) |
Xml 0.1% (1) |
65.25% (1) |
38.75% (89) |
21.25% (5,251) |
12.75% (67,479) |
8.25% (196,554) |
6.13% (326,296) |
5.38% (440,199) |
4.88% (550,570) |
4.25% (661,284) |
DNA 0.001% (1) |
25.00% (1) |
24.25% (5) |
24.13% (18) |
24.00% (67) |
24.00% (260) |
23.75% (1,029) |
23.50% (4,102) |
22.88% (16,349) |
21.25% (62,437) |
DNA 0.01% (1) |
25.00% (1) |
24.25% (5) |
24.13% (18) |
24.00% (67) |
24.00% (260) |
23.75% (1,029) |
23.50% (4,102) |
22.88% (16,368) |
21.25% (63,204) |
DNA 0.1% (1) |
25.00% (1) |
24.25% (5) |
24.13% (19) |
24.00% (70) |
24.00% (264) |
23.75% (1,034) |
23.50% (4,109) |
22.88% (16,399) |
21.38% (65,168) |
English 0.001% (1) |
57.25% (1) |
45.13% (106) |
34.75% (2,659) |
25.88% (18,352) |
19.88% (63,299) |
15.88% (145,194) |
12.50% (256,838) |
9.63% (379,514) |
7.25% (501,400) |
English 0.01% (1) |
57.25% (1) |
45.13% (106) |
34.75% (3,243) |
25.88% (24,063) |
19.88% (82,896) |
15.88% (180,401) |
12.50% (305,292) |
9.63% (439,387) |
7.25% (572,056) |
English 0.1% (1) |
57.25% (1) |
45.25% (106) |
34.88% (4,491) |
26.13% (46,116) |
20.13% (190,765) |
16.00% (439,130) |
12.50% (715,127) |
9.75% (983,435) |
7.25% (1,237,512) |
Pitches 0.001% (1) |
66.13% (1) |
61.00% (73) |
53.50% (3,549) |
37.13% (73,664) |
16.38% (376,958) |
6.25% (642,406) |
2.88% (767,028) |
1.38% (833,456) |
0.75% (871,970) |
Pitches 0.01% (1) |
66.13% (1) |
61.00% (73) |
53.50% (3,581) |
37.25% (76,900) |
16.38% (399,435) |
6.25% (684,445) |
2.88% (821,533) |
1.38% (898,126) |
0.75% (946,219) |
Pitches 0.1% (1) |
66.13% (1) |
61.13% (73) |
53.63% (3,733) |
37.38% (95,838) |
16.63% (598,394) |
6.38% (1,096,014) |
2.88% (1,363,610) |
1.50% (1,543,086) |
0.88% (1,687,166) |
Proteins 0.001% (1) |
52.25% (1) |
52.13% (21) |
51.63% (422) |
47.50% (8,045) |
25.13% (128,975) |
4.63% (463,357) |
0.75% (572,530) |
0.25% (589,356) |
0.25% (595,906) |
Proteins 0.01% (1) |
52.25% (1) |
52.13% (21) |
51.63% (422) |
47.50% (8,045) |
25.13% (131,064) |
4.63% (494,845) |
0.75% (626,269) |
0.25% (654,067) |
0.25% (670,075) |
Proteins 0.1% (1) |
52.25% (1) |
52.13% (21) |
51.63% (425) |
47.50% (8,076) |
25.50% (143,879) |
4.88% (768,510) |
0.88% (1,150,595) |
0.38% (1,293,347) |
0.38% (1,403,589) |
Sources 0.001% (1) |
68.75% (1) |
46.88% (98) |
30.00% (4,557) |
19.63% (29,667) |
14.38% (75,316) |
11.00% (130,527) |
8.38% (194,105) |
6.88% (259,413) |
5.75% (320,468) |
Sources 0.01% (1) |
68.75% (1) |
46.88% (98) |
30.00% (5,621) |
19.63% (42,303) |
14.38% (102,977) |
11.00% (170,525) |
8.50% (244,755) |
6.88% (320,237) |
5.75% (391,260) |
Sources 0.1% (1) |
68.75% (1) |
47.00% (98) |
30.25% (7,359) |
19.88% (104,679) |
14.63% (299,799) |
11.13% (498,046) |
8.50% (687,941) |
7.00% (872,189) |
5.88% (1,049,051) |
Collection |
H0 |
H1 |
H2 |
H3 |
H4 |
H5 |
H6 |
H7 |
H8 |
Xml 0.001% (2) |
65.25% (1) |
38.63% (89) |
21.13% (3,325) |
12.63% (20,560) |
8.13% (56,120) |
6.00% (98,084) |
5.25% (134,897) |
4.75% (168,846) |
4.13% (200,451) |
Xml 0.01% (2) |
65.25% (1) |
39.38% (89) |
22.00% (4,135) |
13.25% (31,042) |
8.63% (79,630) |
6.50% (132,163) |
5.63% (178,388) |
5.13% (221,499) |
4.50% (262,329) |
Xml 0.1% (2) |
65.25% (1) |
44.00% (89) |
28.75% (5,255) |
18.50% (72,227) |
12.25% (226,418) |
9.25% (378,994) |
8.00% (513,539) |
7.13% (645,141) |
6.25% (777,226) |
DNA 0.001% (2) |
25.00% (1) |
24.25% (5) |
24.13% (18) |
24.00% (67) |
24.00% (260) |
23.75% (1,029) |
23.50% (4,102) |
22.88% (16,349) |
21.25% (62,436) |
DNA 0.01% (2) |
25.00% (1) |
24.25% (5) |
24.13% (18) |
24.13% (67) |
24.00% (260) |
23.88% (1,029) |
23.50% (4,102) |
23.00% (16,369) |
21.38% (63,242) |
DNA 0.1% (2) |
25.00% (1) |
24.50% (5) |
24.38% (19) |
24.25% (70) |
24.25% (264) |
24.13% (1,034) |
23.88% (4,109) |
23.50% (16,400) |
22.38% (65,387) |
English 0.001% (2) |
57.25% (1) |
45.13% (106) |
34.75% (2,659) |
26.00% (18,353) |
20.00% (63,300) |
15.88% (145,195) |
12.50% (256,838) |
9.63% (379,514) |
7.13% (501,400) |
English 0.01% (2) |
57.25% (1) |
45.50% (106) |
35.38% (3,243) |
26.50% (24,079) |
20.25% (83,037) |
15.88% (180,592) |
12.38% (305,458) |
9.50% (439,539) |
7.13% (572,186) |
English 0.1% (2) |
57.38% (1) |
47.75% (106) |
39.50% (4,482) |
31.13% (47,357) |
23.00% (202,366) |
16.63% (466,838) |
12.13% (749,065) |
8.88% (1,015,587) |
6.38% (1,265,447) |
Pitches 0.001% (2) |
66.13% (1) |
61.13% (73) |
53.63% (3,549) |
37.25% (73,664) |
16.38% (376,958) |
6.25% (642,406) |
2.88% (767,028) |
1.38% (833,456) |
0.75% (871,970) |
Pitches 0.01% (2) |
66.13% (1) |
61.13% (73) |
53.88% (3,581) |
37.50% (76,917) |
16.50% (399,546) |
6.38% (684,518) |
2.88% (821,589) |
1.38% (898,152) |
0.88% (946,228) |
Pitches 0.1% (2) |
66.13% (1) |
62.00% (73) |
55.88% (3,742) |
40.25% (96,359) |
17.38% (606,175) |
6.50% (1,103,560) |
3.13% (1,367,417) |
1.88% (1,545,154) |
1.38% (1,688,526) |
Proteins 0.001% (2) |
52.25% (1) |
52.13% (21) |
51.63% (422) |
47.50% (8,045) |
25.25% (128,975) |
4.63% (463,357) |
0.75% (572,529) |
0.25% (589,356) |
0.25% (595,906) |
Proteins 0.01% (2) |
52.25% (1) |
52.13% (21) |
51.63% (422) |
47.63% (8,045) |
25.75% (131,079) |
5.00% (494,846) |
0.88% (626,306) |
0.50% (654,107) |
0.38% (670,114) |
Proteins 0.1% (2) |
52.25% (1) |
52.13% (21) |
51.75% (426) |
48.75% (8,072) |
30.13% (143,924) |
7.63% (771,311) |
2.13% (1,154,106) |
1.50% (1,297,080) |
1.38% (1,407,901) |
Sources 0.001% (2) |
68.75% (1) |
47.00% (98) |
30.00% (4,557) |
19.75% (29,667) |
14.38% (75,316) |
11.00% (130,527) |
8.50% (194,105) |
6.88% (259,413) |
5.75% (320,468) |
Sources 0.01% (2) |
68.75% (1) |
47.50% (98) |
30.75% (5,615) |
20.13% (42,337) |
14.63% (103,082) |
11.13% (170,646) |
8.63% (244,874) |
7.00% (320,346) |
5.88% (391,369) |
Sources 0.1% (2) |
68.75% (1) |
51.25% (98) |
36.63% (7,372) |
24.38% (108,997) |
16.75% (319,310) |
12.13% (525,914) |
9.13% (718,657) |
7.25% (904,022) |
6.00% (1,080,824) |
Collection |
delta |
z |
v |
r |
g |
Xml (100MiB) |
1,633,682 |
3,718,234 |
3,702,841 |
15,060,347 |
5,858,863 |
DNA (100MiB) |
5,295,537 |
7,368,719 |
7,344,256 |
65,847,105 |
13,231,574 |
English (100MiB) |
4,367,652 |
7,408,437 |
7,385,929 |
36,383,520 |
11,148,647 |
Pitches (50MiB) |
3,125,803 |
5,782,285 |
5,755,922 |
22,527,480 |
10,263,313 |
Proteins (100MiB) |
8,225,356 |
12,256,952 |
12,171,663 |
61,956,836 |
20,616,331 |
Sources (100MiB) |
3,207,699 |
6,208,156 |
6,188,857 |
24,458,663 |
10,205,785 |
Send Mail to Us | © P. Ferragina and G. Navarro, Last update: October, 2010.
|