We recommend using one text of each type provided in
the corpus, to cover a representative set of relevant applications,
unless you are interested in a specific application. We give one text
of each type, but also prefixes of 50MB, 100MB and 200MB of it. This
way the user can choose the size that best fits its domain of
application of the compressed index. We point out that the English text
has also a prefix of 1GB, in case you have trouble to handle the full
>2GB text.
In case you are interested in monitoring the
performance for increasing text sizes, you will prefer to test
increasing prefixes of a text collection. In the utilities there is a
program to cut text prefixes of any length.
The performance of a compressed index may be evaluated either at construction time or at query time.
At construction time we are interested in:
- Time to build the index, considering user+system time.
- Internal memory working space:
amount of RAM necessary to build the index (peak usage along the
construction process). Note that we do not consider the temporary disk
usage, as this is difficult to measure. This means that a construction
algorithm that uses secondary memory can improve this measure, in
exchange for higher construction time (I/Os).
- Permanent space on disk:
amount of disk space used to store the index. Note that the index can
be stored in a way that is so compact that it cannot be just loaded and
used, but needs some uncompress operation. This is permissible, and
will be paid in index load time (see next). In an extreme case, one
could just compress the text and declare it as the index on disk, and
build the index from scratch at load time. This will look good in terms
of permanent space on disk, but will incur in a very high load time.
Note that this is not cheating in terms of index size, because the real
index size is measured as the internal memory usage to solve queries
At query time we are interested in:
- Index load time: user+system time to read the index from disk so that it becomes ready for querying. See the previous item permanent space on disk.
- Internal memory usage to solve queries:
amount of RAM necessary to answer a substring query (peak usage along
the query process). This is a measure of the actual space needed by the
index to operate. To ensure that an index uses only the internal
memory, you can run the query process without any write permission on
disk. The indexes usually have parameters that trade query time for
index space (disk or internal working space), so this has to be
measured as a function of these parameters.
- Query time to solve queries:
user+system time to carry out queries of different type. The same
comments on space usage above (parameter dependence and expected
behavior depending on queries) apply for time.
The last two measures, namely internal memory usage and query time,
are relevant for the following queries:
- Counting: Usually one is interested in
the performance as a function of the pattern length, so the experiment
should report the results in this way, or just in terms of time per
pattern character if the function turns out to be linear. For each
length, a sufficient number of searches must be averaged so as to take
care of the variance due to pattern characteristics and to computer
time.
- Locating: As the process has to allocate an
array of occurrences to return, this array is included in its memory
usage (although, as we report peak memory usage, it is not necessarily
clear how to discount it). Time results can be given normalized as time
per occurrence.
- Displaying: As the process has to
allocate an array of snippets to return, this array is included in its
memory usage. As for time, it can be normalized as time per character
displayed.
- Extracting: As the process has to
allocate a text string to return, the space for this string is included
in its memory usage. As for time, it can be normalized as time per
character extracted.
We are aware of the fact that performing all these
measures is a repetitive process and could take much time. Since all
indexes offer the same API we provide some utility programs to automate the testing phase. Please download them before proceeding further in your experimentation (in the link you will find each file separately as well as utils.tar.gz with them all except the compressors). Below we comment some of their main features.
We provide a bunch of programs to manipulate the texts in the collections and to extract some statistics from them.
- cut <filename> <X>
that produces
<filename>.<X>MB, a prefix of <filename> of <X>
megabytes. Files are cut.c (source) and cut (executable).
- alph <filename>
gives basic alphabet statistics on <filename>. Files are alph.c (source) and alph (executable).
- entrop <filename> <k>
computes
the empirical entropy of order <k> of <filename>. Files are
entrop.c (source) and entrop (executable). There are variants entrop2 and entrop3 with different time/memory tradeoffs.
- gentext <filename> <n> <s>
generates in <filename> a random text of <n> megabytes,
uniformly distributed over <s> characters, choosing ASCII values
1 to <s>. Files are gentext.c (source) and gentext (executable).
- gzip sources and executables. Files are gzip-1.3.3.tar.gz (sources), gzip-1.3.3 (compressor), and gunzip (decompressor).
- bzip2 sources and executables. Files are bzip2-1.0.3.tar.gz (sources), bzip2-1.0.3 (compressor), and bunzip2 (decompressor).
- ppmdi sources and executables (the version we used for our
statistics). Files are ppmdi.tar.gz (sources), ppmdi (compressor), and
unppmdi (decompressor).
We provide a bunch of programs to generate random
query patterns to be searched, and random snippets to be extracted.
These will constitute the material for our experiments over the
compressed indexes.
- genpatterns <filename> <length> <number> <patterns file> [<forbidden>]
It generates in <patterns file> a <number> of
substrings consisting of <length> characters chosen at random
from <filename>. Strings with characters in <forbidden> are
discarded (for nonprintable
chracters <forbidden> uses \n, \t, etc. or \cN where N is a ASCII code written using 3 digits). The output file starts with a line of the form
# number=<number> length=<length> file=<filename>
forbidden=<forbidden>
The <number> patterns then come one after the other without any
separator. Files are genpatterns.c (source) and genpatterns (executable).
- genintervals <filename> <length> <number> <intervals file>
It generates in <intervals file> a <number> of intervals
having <length> chosen at random from <filename>. The
output file starts with a line of the form
# number=<number> length=<length> file=<filename>
Then follow <number> lines having the form <from>,<to>.
Files are genintervals.c (source) and genintervals (executable).
- interface.h
It is the interface file providing the functions of the library. File is interface.h.
- build_index <source file> <index file> [<parameters>]
It builds the index for the text in file <source file>,
storing it in <index file> (possibly with extensions on that file
name). Any additional <parameters> will be passed to the
construction function. At the end, the program sends to the standard
error performance measures on time to build the index. File is build_index.c (source).
- run_queries <index file> <type> [<length>] [V]
It loads <index> and then executes over it the queries it
receives from the standard input. The standard input comes in the
format of the files written by genpatterns or genintervals. The query types are given by <type>. Value C means counting queries, L locating queries, D displaying queries, and E extracting queries. For the first three queries the standard input must follow the format of the file produced by genpatterns. For the last query (namely E), the format of the standard input must be the one of genintervals.
Parameter <length> must be provided in case of displaying queries
(D), and denotes the number of characters to display before and after
each pattern occurrence. With the option V the program reports the results of a set of queries on the stantard output (in binary format). This output may be used with compare for checking the correctness of the query results. The program also reports on the standard error some time statistics regarding the query execution. File is run_queries.c (source).
-
compare <results file1> <results file2>
<results file1> and <results file2> are binary files produced by run_queries with the option V. The program finds the differences between the two files. It is useful to check the correctness and completness of an index by comparing it against a trusted index (e.g. suffix array, which is made available). Files are compare.c (source) and compare (executable).
-
memusage [option] <program>
We suggest to use the program by Ulrich Drepper for taking the peaks of internal memory usage by a given executable (i.e. <program>). This program reports on the standard output the peak of both stack and heap usage. It can also generate a graphical output, for which you need the gd-devel package. You can run memusage --help for more details. Files are memusage (source) and gd-devel.
We note that build_index.c and run_queries.c are source programs that can be compiled to build_index.o and run_queries.o. To obtain the executables, they must be linked with each different index implementation. They make use of the index interface interface.h provided above. This interface is also useful to write other programs that use the index. A makefile is provided that compiles all the programs (except the compressors).
|