Statistics for Artificial Collections 
This subset is composed of highly repetitive texts that do not come from any reallife source, but are artificially generated through some mathematical definition and have interesting combinatorial properties. Fibonacci Sequence F(n) This sequence is defined by the recurrence F(1) = 0 F(2) = 1 F(n) = F(n  1) + F(n  2) The length of the string F(n) is the Fibonacci number f(n), and the sequence is a sturmian word, which means it has i + 1 different substrings (factors) of length i. ThueMorse Sequence T(n) This sequence is defined by the recurrence T(1)=0 T(n) = T(n1)T(n1) where F is the bitwise negation operator. Because of the construction scheme of this sequence, there are many substrings of the form XX, for any string X. However, there are no overlapping squares, i.e., substrings of the form 0X0X0 or 1X1X1. Furthermore, this sequence is strongly cubefree, i.e., there are no substrings of the form XXx, where x is the first character of the string X. Another interesting property of this string is that it is recurrent. That is, given any finite substring w of length n, there is some length n(w) (often much longer than n) such that w is contained in every substring of length n(w). The length of these strings is Tn= 2n. RunRich String Sequence R(n) A measure of string complexity, related to the regularities of the text and strongly related to the LZ77 parsing, is the number of runs. The substring T[i, j] is a run in a string T iff the minimum period of T[i, j] = p <= T[i, j] / 2 and T[i, j] is not extendable to the right (j = n or T[j + 1] /= T[j  p + 1]) or left (i = 1 or T[i  1] /= T[i + p  1]). The higher the number of runs in a string, the more regularities it has. It has been shown that the maximum number of runs in a string is greater than 0.944n and lower than 1.029n. Franek et al. show a constructive and simple way to obtain strings with many runs; the nth of those strings is denoted R(n). The ratio of the runs of their strings to the length approaches 3/(1 + sqrt(5)) = 0.92705....
Send Mail to Us  © P. Ferragina and G. Navarro, Last update: October, 2010.
