Here we present a corpus of repetitive texts. These texts are categorized according to the source they come from into the following: Artificial Texts, Pseudo-Real Texts and Real Texts. The main goal of this collection is to serve as a standard testbed for benchmarking algorithms oriented to repetitive texts.
DownloadThe files are compresed using p7zip and gzip for saving bandwidth.
StatisticsThe compression statistics of all texts, as well as information of the origin of them, can be viewed in the following PDF file.
IndexesThe following indexes are specifically oriented to repetitive texts.
Send Mail to Us | © P. Ferragina and G. Navarro, Last update: October, 2010.