Pizza&Chili Corpus
Compressed Indexes and their Testbeds

The Italian mirror | The Chilean mirror

Api

We are particularly interested in self-indexes, namely compressed indexes that encapsulate sufficient information to reproduce any substring of the indexed text, and thus possibly the text itself. If a compressed index is not a self-index, then one must keep the text together with the index and report the text size plus the index size.

To use a compressed index over a text, we first have to build it, and then we can either query it to count or locate the occurrences of the queried pattern, or we can access some snippets of the indexed text for displaying the context of a pattern occurrence, or for retrieving some text substrings (possibly the whole text).

Indexes are used through the following API interface, written in the C/C++ language. We actually use uchar for denoting unsigned char and ulong for denoting unsigned long. The interface assumes that each text symbol is represented in one byte. The integer e returned by any procedure indicates an error code, if different of zero. The error message can be accessed by calling the procedure char *error_index(e). We further recall that text and pattern indexes start at zero. Below you find a schematic summary of the API interface offered by all the compressed indexes available for downloading. Please read carefully the COPYRIGHT information that comes with each of them.

 

Building the index

Function  Parameters  Comment 

int build_index

uchar *text,
ulong length,
char *build_options,
void **index

Creates index from text[0..length-1]. Note that the index is an opaque data type. Any build option must be passed in string build_options, whose syntax depends on the index. The index must always work with some default parameters if build_options is NULL. The returned index is ready to be queried.

int save_index

void *index,
char *filename

Saves index on disk by using single or multiple files, having proper extensions.

int load_index

char *filename,
void **index

Loads index from one or more file(s) named filename, possibly adding the proper extensions.

int free_index

void *index

Frees the memory occupied by index.

int index_size

void *index,
ulong *size

Writes in size the size, in bytes, of index. This should be the size needed by the index to perform any of the operations it implements.

 

Querying the index

Function  Parameters  Comment 

int count

void *index,
uchar *pattern,
ulong length,
ulong *numocc

Writes in numocc the number of occurrences of the substring pattern[0..length-1] found in the text indexed by index.

int locate

void *index,
uchar *pattern,
ulong length,
ulong **occ,
ulong *numocc

Writes in numocc the number of occurrences of the substring pattern[0..length-1] in the text indexed by index. It also allocates occ (which must be freed by the caller) and writes the locations of the numocc occurrences in occ, in arbitrary order.

 


Accessing the indexed text

Function  Parameters  Comment 

int extract

void *index,
ulong from,
ulong to,
uchar **snippet,
ulong *snippet_length

Allocates snippet (which must be freed by the caller) and writes the substring text[from..to] into it. Returns in snippet_length the length of the text snippet actually extracted (that could be less than to-from+1 if to is larger than the text size).

int display

void *index,
uchar *pattern,
ulong length,
ulong numc,
ulong *numocc,
uchar **snippet_text,
ulong **snippet_lengths

Displays the text (snippet) surrounding any occurrence of the substring pattern[0..length-1] within the text indexed by index. The snippet must include numc characters before and after the pattern occurrence, totalizing length+2*numc characters, or less if the text boundaries are reached. Writes in numocc the number of occurrences, and allocates the arrays snippet_text and snippet_lengths (which must be freed by the caller). The first is a character array of numocc*(length+2*numc) characters, with a new snippet starting at every multiple of length+2*numc. The second gives the real length of each of the numocc snippets.

int length

void *index,
ulong *length

Obtains the length of the text indexed by index.



Send Mail to Us | © P. Ferragina and G. Navarro, Last update: September, 2005.