Repetitive Corpus

Here we present a corpus of repetitive texts. These texts are categorized according to the source they come from into the following: Artificial Texts, Pseudo-Real Texts and Real Texts. The main goal of this collection is to serve as a standard testbed for benchmarking algorithms oriented to repetitive texts.


The files are compresed using p7zip and gzip for saving bandwidth.


The compression statistics of all texts, as well as information of the origin of them, can be viewed in the following PDF file.


The following indexes are specifically oriented to repetitive texts.

