| 
 Here we present a set of texts that were generated by artificially adding repetitiveness
to real texts, thus we call them pseudo-real texts. To generate the texts, we took a prefix of 1MiB of all texts of Pizza&Chili Corpus
and we mutated them. Our mutations take a random character position and change
it to a random character different from the original one. We used two different schemes for the mutations. The first one, denoted by a (1),
generates different mutations of the first text. The second, denoted by a (2), mutates
the last text generated. The second scheme resembles the changes obtained through
time in a software project or the versions of a document. The mutation rate, i.e., percentage of mutated characters, was set to 0.1%, 0.01%
and 0.001%. The base texts (all from the Pizza&Chili corpus) we mutated were the following: 
Sources: This file is formed by C/Java source code obtained by concatenating all
the .c, .h, .C and .java files of the linux-2.6.11.6 and gcc-4.0.0 distributions.Pitches: This file is a sequence of midi pitch values (bytes in 0-127, plus a few
extra special values) obtained from a myriad of MIDI files freely available on
Internet.Proteins: This file is a sequence of newline-separated protein sequences obtained
from the Swissprot database.DNA: This file is a sequence of newline-separated gene DNA sequences obtained
from files 01hgp10 to 21hgp10, plus 0xhgp10 and 0yhgp10, from Gutenberg
Project.English: This file is the concatenation of English text files selected from etext02
to etext05 collections of Gutenberg Project.XML: This file is an XML that provides bibliographic information on major
computer science journals and proceedings and it was obtained from http:
//dblp.uni-trier.de. 
  
    | Collection | Size (MiB) | Alphabet size | Inv match prob |  
    | Xml 0.001% (1) | 100MiB | 89 | 27.84 |  
    | Xml 0.01% (1) | 100MiB | 89 | 27.84 |  
    | Xml 0.1% (1) | 100MiB | 89 | 27.84 |  
    | DNA 0.001% (1) | 100MiB | 5 | 3.98 |  
    | DNA 0.01% (1) | 100MiB | 5 | 3.98 |  
    | DNA 0.1% (1) | 100MiB | 5 | 3.98 |  
    | English 0.001% (1) | 100MiB | 106 | 15.65 |  
    | English 0.01% (1) | 100MiB | 106 | 15.65 |  
    | English 0.1% (1) | 100MiB | 106 | 15.65 |  
    | Pitches 0.001% (1) | 100MiB | 73 | 33.07 |  
    | Pitches 0.01% (1) | 100MiB | 73 | 33.07 |  
    | Pitches 0.1% (1) | 100MiB | 73 | 33.07 |  
    | Proteins 0.001% (1) | 100MiB | 21 | 16.90 |  
    | Proteins 0.01% (1) | 100MiB | 21 | 16.90 |  
    | Proteins 0.1% (1) | 100MiB | 21 | 16.90 |  
    | Sources 0.001% (1) | 100MiB | 98 | 28.86 |  
    | Sources 0.01% (1) | 100MiB | 98 | 28.86 |  
    | Sources 0.1% (1) | 100MiB | 98 | 28.86 |  
 
  
    | Collection | Size (MiB) | Alphabet size | Inv match prob |  
    | Xml 0.001% (2) | 100MiB | 89 | 27.84 |  
    | Xml 0.01% (2) | 100MiB | 89 | 27.84 |  
    | Xml 0.1% (2) | 100MiB | 89 | 27.86 |  
    | DNA 0.001% (2) | 100MiB | 5 | 3.98 |  
    | DNA 0.01% (2) | 100MiB | 5 | 3.98 |  
    | DNA 0.1% (2) | 100MiB | 5 | 3.98 |  
    | English 0.001% (2) | 100MiB | 106 | 15.65 |  
    | English 0.01% (2) | 100MiB | 106 | 15.66 |  
    | English 0.1% (2) | 100MiB | 106 | 15.74 |  
    | Pitches 0.001% (2) | 100MiB | 73 | 33.07 |  
    | Pitches 0.01% (2) | 100MiB | 73 | 33.07 |  
    | Pitches 0.1% (2) | 100MiB | 73 | 33.10 |  
    | Proteins 0.001% (2) | 100MiB | 21 | 16.90 |  
    | Proteins 0.01% (2) | 100MiB | 21 | 16.90 |  
    | Proteins 0.1% (2) | 100MiB | 21 | 16.92 |  
    | Sources 0.001% (2) | 100MiB | 98 | 28.86 |  
    | Sources 0.01% (2) | 100MiB | 98 | 28.86 |  
    | Sources 0.1% (2) | 100MiB | 98 | 28.92 |  
 
  
    | Collection | p7zip | bzip2 | gzip | ppmdi | Re-Pair |  
    | Xml 0.001% (1) | 0.15% | 11.00% | 18.00% | 3.50% | 0.19% |  
    | Xml 0.01% (1) | 0.18% | 12.00% | 18.00% | 3.60% | 0.46% |  
    | Xml 0.1% (1) | 0.46% | 12.00% | 18.00% | 4.10% | 2.00% |  
    | DNA 0.001% (1) | 0.27% | 27.00% | 28.00% | 11.00% | 0.34% |  
    | DNA 0.01% (1) | 0.29% | 27.00% | 28.00% | 11.00% | 0.58% |  
    | DNA 0.1% (1) | 0.51% | 27.00% | 28.00% | 12.00% | 2.50% |  
    | English 0.001% (1) | 0.31% | 28.00% | 37.00% | 22.00% | 0.39% |  
    | English 0.01% (1) | 0.35% | 28.00% | 37.00% | 22.00% | 0.65% |  
    | English 0.1% (1) | 0.59% | 28.00% | 37.00% | 22.00% | 2.70% |  
    | Pitches 0.001% (1) | 0.47% | 54.00% | 52.00% | 47.00% | 0.69% |  
    | Pitches 0.01% (1) | 0.50% | 54.00% | 52.00% | 47.00% | 0.95% |  
    | Pitches 0.1% (1) | 0.75% | 54.00% | 52.00% | 48.00% | 3.20% |  
    | Proteins 0.001% (1) | 0.32% | 41.00% | 39.00% | 31.00% | 0.42% |  
    | Proteins 0.01% (1) | 0.35% | 41.00% | 39.00% | 31.00% | 0.68% |  
    | Proteins 0.1% (1) | 0.59% | 41.00% | 39.00% | 32.00% | 2.70% |  
    | Sources 0.001% (1) | 0.20% | 19.00% | 25.00% | 12.00% | 0.28% |  
    | Sources 0.01% (1) | 0.23% | 19.00% | 25.00% | 12.00% | 0.56% |  
    | Sources 0.1% (1) | 0.50% | 20.00% | 25.00% | 13.00% | 2.60% |  
 
  
    | Collection | p7zip | bzip2 | gzip | ppmdi | Re-Pair |  
    | Xml 0.001% (2) | 0.15% | 12.00% | 18.00% | 3.50% | 0.18% |  
    | Xml 0.01% (2) | 0.18% | 14.00% | 19.00% | 4.40% | 0.39% |  
    | Xml 0.1% (2) | 0.39% | 25.00% | 29.00% | 17.00% | 2.20% |  
    | DNA 0.001% (2) | 0.26% | 27.00% | 28.00% | 11.00% | 0.33% |  
    | DNA 0.01% (2) | 0.29% | 27.00% | 28.00% | 11.00% | 0.52% |  
    | DNA 0.1% (2) | 0.46% | 27.00% | 28.00% | 13.00% | 2.20% |  
    | English 0.001% (2) | 0.31% | 28.00% | 37.00% | 22.00% | 0.38% |  
    | English 0.01% (2) | 0.34% | 29.00% | 37.00% | 23.00% | 0.59% |  
    | English 0.1% (2) | 0.55% | 38.00% | 43.00% | 31.00% | 2.50% |  
    | Pitches 0.001% (2) | 0.46% | 54.00% | 52.00% | 47.00% | 0.68% |  
    | Pitches 0.01% (2) | 0.49% | 54.00% | 53.00% | 48.00% | 0.89% |  
    | Pitches 0.1% (2) | 0.71% | 59.00% | 57.00% | 52.00% | 2.80% |  
    | Proteins 0.001% (2) | 0.31% | 41.00% | 39.00% | 32.00% | 0.41% |  
    | Proteins 0.01% (2) | 0.34% | 42.00% | 40.00% | 33.00% | 0.62% |  
    | Proteins 0.1% (2) | 0.54% | 47.00% | 46.00% | 40.00% | 2.50% |  
    | Sources 0.001% (2) | 0.20% | 20.00% | 25.00% | 13.00% | 0.27% |  
    | Sources 0.01% (2) | 0.23% | 21.00% | 26.00% | 14.00% | 0.49% |  
    | Sources 0.1% (2) | 0.44% | 34.00% | 35.00% | 26.00% | 2.50% |  
 
  
    | Collection | H0 | H1 | H2 | H3 | H4 | H5 | H6 | H7 | H8 |  
    | Xml 0.001% (1)
 | 65.25% (1)
 | 38.63% (89)
 | 21.00% (3,325)
 | 12.50% (20,560)
 | 8.13% (56,120)
 | 6.00% (98,084)
 | 5.25% (134,897)
 | 4.75% (168,846)
 | 4.13% (200,451)
 |  
    | Xml 0.01% (1)
 | 65.25% (1)
 | 38.63% (89)
 | 21.00% (4,135)
 | 12.50% (30,975)
 | 8.13% (79,379)
 | 6.00% (131,811)
 | 5.25% (177,924)
 | 4.75% (220,923)
 | 4.13% (261,651)
 |  
    | Xml 0.1% (1)
 | 65.25% (1)
 | 38.75% (89)
 | 21.25% (5,251)
 | 12.75% (67,479)
 | 8.25% (196,554)
 | 6.13% (326,296)
 | 5.38% (440,199)
 | 4.88% (550,570)
 | 4.25% (661,284)
 |  
    | DNA 0.001% (1)
 | 25.00% (1)
 | 24.25% (5)
 | 24.13% (18)
 | 24.00% (67)
 | 24.00% (260)
 | 23.75% (1,029)
 | 23.50% (4,102)
 | 22.88% (16,349)
 | 21.25% (62,437)
 |  
    | DNA 0.01% (1)
 | 25.00% (1)
 | 24.25% (5)
 | 24.13% (18)
 | 24.00% (67)
 | 24.00% (260)
 | 23.75% (1,029)
 | 23.50% (4,102)
 | 22.88% (16,368)
 | 21.25% (63,204)
 |  
    | DNA 0.1% (1)
 | 25.00% (1)
 | 24.25% (5)
 | 24.13% (19)
 | 24.00% (70)
 | 24.00% (264)
 | 23.75% (1,034)
 | 23.50% (4,109)
 | 22.88% (16,399)
 | 21.38% (65,168)
 |  
    | English 0.001% (1)
 | 57.25% (1)
 | 45.13% (106)
 | 34.75% (2,659)
 | 25.88% (18,352)
 | 19.88% (63,299)
 | 15.88% (145,194)
 | 12.50% (256,838)
 | 9.63% (379,514)
 | 7.25% (501,400)
 |  
    | English 0.01% (1)
 | 57.25% (1)
 | 45.13% (106)
 | 34.75% (3,243)
 | 25.88% (24,063)
 | 19.88% (82,896)
 | 15.88% (180,401)
 | 12.50% (305,292)
 | 9.63% (439,387)
 | 7.25% (572,056)
 |  
    | English 0.1% (1)
 | 57.25% (1)
 | 45.25% (106)
 | 34.88% (4,491)
 | 26.13% (46,116)
 | 20.13% (190,765)
 | 16.00% (439,130)
 | 12.50% (715,127)
 | 9.75% (983,435)
 | 7.25% (1,237,512)
 |  
    | Pitches 0.001% (1)
 | 66.13% (1)
 | 61.00% (73)
 | 53.50% (3,549)
 | 37.13% (73,664)
 | 16.38% (376,958)
 | 6.25% (642,406)
 | 2.88% (767,028)
 | 1.38% (833,456)
 | 0.75% (871,970)
 |  
    | Pitches 0.01% (1)
 | 66.13% (1)
 | 61.00% (73)
 | 53.50% (3,581)
 | 37.25% (76,900)
 | 16.38% (399,435)
 | 6.25% (684,445)
 | 2.88% (821,533)
 | 1.38% (898,126)
 | 0.75% (946,219)
 |  
    | Pitches 0.1% (1)
 | 66.13% (1)
 | 61.13% (73)
 | 53.63% (3,733)
 | 37.38% (95,838)
 | 16.63% (598,394)
 | 6.38% (1,096,014)
 | 2.88% (1,363,610)
 | 1.50% (1,543,086)
 | 0.88% (1,687,166)
 |  
    | Proteins 0.001% (1)
 | 52.25% (1)
 | 52.13% (21)
 | 51.63% (422)
 | 47.50% (8,045)
 | 25.13% (128,975)
 | 4.63% (463,357)
 | 0.75% (572,530)
 | 0.25% (589,356)
 | 0.25% (595,906)
 |  
    | Proteins 0.01% (1)
 | 52.25% (1)
 | 52.13% (21)
 | 51.63% (422)
 | 47.50% (8,045)
 | 25.13% (131,064)
 | 4.63% (494,845)
 | 0.75% (626,269)
 | 0.25% (654,067)
 | 0.25% (670,075)
 |  
    | Proteins 0.1% (1)
 | 52.25% (1)
 | 52.13% (21)
 | 51.63% (425)
 | 47.50% (8,076)
 | 25.50% (143,879)
 | 4.88% (768,510)
 | 0.88% (1,150,595)
 | 0.38% (1,293,347)
 | 0.38% (1,403,589)
 |  
    | Sources 0.001% (1)
 | 68.75% (1)
 | 46.88% (98)
 | 30.00% (4,557)
 | 19.63% (29,667)
 | 14.38% (75,316)
 | 11.00% (130,527)
 | 8.38% (194,105)
 | 6.88% (259,413)
 | 5.75% (320,468)
 |  
    | Sources 0.01% (1)
 | 68.75% (1)
 | 46.88% (98)
 | 30.00% (5,621)
 | 19.63% (42,303)
 | 14.38% (102,977)
 | 11.00% (170,525)
 | 8.50% (244,755)
 | 6.88% (320,237)
 | 5.75% (391,260)
 |  
    | Sources 0.1% (1)
 | 68.75% (1)
 | 47.00% (98)
 | 30.25% (7,359)
 | 19.88% (104,679)
 | 14.63% (299,799)
 | 11.13% (498,046)
 | 8.50% (687,941)
 | 7.00% (872,189)
 | 5.88% (1,049,051)
 |  
 
  
    | Collection | H0 | H1 | H2 | H3 | H4 | H5 | H6 | H7 | H8 |  
    | Xml 0.001% (2)
 | 65.25% (1)
 | 38.63% (89)
 | 21.13% (3,325)
 | 12.63% (20,560)
 | 8.13% (56,120)
 | 6.00% (98,084)
 | 5.25% (134,897)
 | 4.75% (168,846)
 | 4.13% (200,451)
 |  
    | Xml 0.01% (2)
 | 65.25% (1)
 | 39.38% (89)
 | 22.00% (4,135)
 | 13.25% (31,042)
 | 8.63% (79,630)
 | 6.50% (132,163)
 | 5.63% (178,388)
 | 5.13% (221,499)
 | 4.50% (262,329)
 |  
    | Xml 0.1% (2)
 | 65.25% (1)
 | 44.00% (89)
 | 28.75% (5,255)
 | 18.50% (72,227)
 | 12.25% (226,418)
 | 9.25% (378,994)
 | 8.00% (513,539)
 | 7.13% (645,141)
 | 6.25% (777,226)
 |  
    | DNA 0.001% (2)
 | 25.00% (1)
 | 24.25% (5)
 | 24.13% (18)
 | 24.00% (67)
 | 24.00% (260)
 | 23.75% (1,029)
 | 23.50% (4,102)
 | 22.88% (16,349)
 | 21.25% (62,436)
 |  
    | DNA 0.01% (2)
 | 25.00% (1)
 | 24.25% (5)
 | 24.13% (18)
 | 24.13% (67)
 | 24.00% (260)
 | 23.88% (1,029)
 | 23.50% (4,102)
 | 23.00% (16,369)
 | 21.38% (63,242)
 |  
    | DNA 0.1% (2)
 | 25.00% (1)
 | 24.50% (5)
 | 24.38% (19)
 | 24.25% (70)
 | 24.25% (264)
 | 24.13% (1,034)
 | 23.88% (4,109)
 | 23.50% (16,400)
 | 22.38% (65,387)
 |  
    | English 0.001% (2)
 | 57.25% (1)
 | 45.13% (106)
 | 34.75% (2,659)
 | 26.00% (18,353)
 | 20.00% (63,300)
 | 15.88% (145,195)
 | 12.50% (256,838)
 | 9.63% (379,514)
 | 7.13% (501,400)
 |  
    | English 0.01% (2)
 | 57.25% (1)
 | 45.50% (106)
 | 35.38% (3,243)
 | 26.50% (24,079)
 | 20.25% (83,037)
 | 15.88% (180,592)
 | 12.38% (305,458)
 | 9.50% (439,539)
 | 7.13% (572,186)
 |  
    | English 0.1% (2)
 | 57.38% (1)
 | 47.75% (106)
 | 39.50% (4,482)
 | 31.13% (47,357)
 | 23.00% (202,366)
 | 16.63% (466,838)
 | 12.13% (749,065)
 | 8.88% (1,015,587)
 | 6.38% (1,265,447)
 |  
    | Pitches 0.001% (2)
 | 66.13% (1)
 | 61.13% (73)
 | 53.63% (3,549)
 | 37.25% (73,664)
 | 16.38% (376,958)
 | 6.25% (642,406)
 | 2.88% (767,028)
 | 1.38% (833,456)
 | 0.75% (871,970)
 |  
    | Pitches 0.01% (2)
 | 66.13% (1)
 | 61.13% (73)
 | 53.88% (3,581)
 | 37.50% (76,917)
 | 16.50% (399,546)
 | 6.38% (684,518)
 | 2.88% (821,589)
 | 1.38% (898,152)
 | 0.88% (946,228)
 |  
    | Pitches 0.1% (2)
 | 66.13% (1)
 | 62.00% (73)
 | 55.88% (3,742)
 | 40.25% (96,359)
 | 17.38% (606,175)
 | 6.50% (1,103,560)
 | 3.13% (1,367,417)
 | 1.88% (1,545,154)
 | 1.38% (1,688,526)
 |  
    | Proteins 0.001% (2)
 | 52.25% (1)
 | 52.13% (21)
 | 51.63% (422)
 | 47.50% (8,045)
 | 25.25% (128,975)
 | 4.63% (463,357)
 | 0.75% (572,529)
 | 0.25% (589,356)
 | 0.25% (595,906)
 |  
    | Proteins 0.01% (2)
 | 52.25% (1)
 | 52.13% (21)
 | 51.63% (422)
 | 47.63% (8,045)
 | 25.75% (131,079)
 | 5.00% (494,846)
 | 0.88% (626,306)
 | 0.50% (654,107)
 | 0.38% (670,114)
 |  
    | Proteins 0.1% (2)
 | 52.25% (1)
 | 52.13% (21)
 | 51.75% (426)
 | 48.75% (8,072)
 | 30.13% (143,924)
 | 7.63% (771,311)
 | 2.13% (1,154,106)
 | 1.50% (1,297,080)
 | 1.38% (1,407,901)
 |  
    | Sources 0.001% (2)
 | 68.75% (1)
 | 47.00% (98)
 | 30.00% (4,557)
 | 19.75% (29,667)
 | 14.38% (75,316)
 | 11.00% (130,527)
 | 8.50% (194,105)
 | 6.88% (259,413)
 | 5.75% (320,468)
 |  
    | Sources 0.01% (2)
 | 68.75% (1)
 | 47.50% (98)
 | 30.75% (5,615)
 | 20.13% (42,337)
 | 14.63% (103,082)
 | 11.13% (170,646)
 | 8.63% (244,874)
 | 7.00% (320,346)
 | 5.88% (391,369)
 |  
    | Sources 0.1% (2)
 | 68.75% (1)
 | 51.25% (98)
 | 36.63% (7,372)
 | 24.38% (108,997)
 | 16.75% (319,310)
 | 12.13% (525,914)
 | 9.13% (718,657)
 | 7.25% (904,022)
 | 6.00% (1,080,824)
 |  
 
  
    | Collection | delta | z | v | r | g |  
    | Xml (100MiB) | 1,633,682 | 3,718,234 | 3,702,841 | 15,060,347 | 5,858,863 |  
    | DNA (100MiB) | 5,295,537 | 7,368,719 | 7,344,256 | 65,847,105 | 13,231,574 |  
    | English (100MiB) | 4,367,652 | 7,408,437 | 7,385,929 | 36,383,520 | 11,148,647 |  
    | Pitches (50MiB) | 3,125,803 | 5,782,285 | 5,755,922 | 22,527,480 | 10,263,313 |  
    | Proteins (100MiB) | 8,225,356 | 12,256,952 | 12,171,663 | 61,956,836 | 20,616,331 |  
    | Sources (100MiB) | 3,207,699 | 6,208,156 | 6,188,857 | 24,458,663 | 10,205,785 |  
 
  
    Send Mail to Us | © P. Ferragina and G. Navarro, Last update: October, 2010.
   
 |