Text Indexing and Lookup - eXist: A NoSQL Document Database and Application Platform (2015)

eXist: A NoSQL Document Database and Application Platform (2015)

Chapter 12. Text Indexing and Lookup

Besides the “basic” indexing capabilities, as explained in Chapter 11, eXist also supports full-text indexes based on the Apache Lucene text search-engine library. Lucene allows eXist to offer search capabilities like looking for words near each other or words like other words, using Boolean text comparison operators, and more. Full-text indexes allow you to do much more with your content than you can do using straight XPath expressions.

If your application needs to support searches based on human input, such as searching documentation or the like, full-text indexes can really help. But things get even better: on top of the full-text index searches, eXist offers keywords in context (KWIC) functionality. This makes it extremely easy to display the results of your searches in context, showing the search results within the surrounding text. We’ll examine this further in “Using Keywords in Context”.

Full-Text Index and KWIC Example

The examples for this book include a simple full-text search example. This example searches, using the full-text index, over some ancient Encyclopedia Britannica entries. Important components of the example are:

§ A full-text index on tei:p elements, defined in /db/system/config/db/apps/exist-book/indexing/data/collection.xconf:

§ <collection xmlns="http://exist-db.org/collection-config/1.0">

§ <index xmlns:tei="http://www.tei-c.org/ns/1.0">

§

§ <!-- other indexes -->

§

§ <lucene>

§ <text qname="tei:p"/>

§ </lucene>

§ </index>

</collection>

§ An extremely simple HTML form that allows you to enter a search expression, in /db/apps/exist-book/indexing/search-demo.xq

§ A script that uses this expression to perform a search, in /db/apps/exist-book/indexing/search-demo-result.xq

If you look at the code in search-demo-result.xq that actually performs the search and displays the results, there is surprisingly little there:

{

for $hit in doc($doc-with-indexes)//tei:p[ft:query(., $search-expression)] 1

let $score as xs:float := ft:score($hit) 2

order by $score descending

return (

<p>Score: {$score}:</p>,

kwic:summarize($hit, <config width="40"/>) 3

)

}

1

First we do a full-text query using the ft:query function on tei:p elements. This works because we have a Lucene index defined on these elements.

2

Then we get the score for every search result using ft:score. This returns a floating-point number. The higher the number, the more relevant Lucene thinks the match is. We order the results by score, resulting in the most relevant first.

3

The kwic:summarize function has the ability to summarize the search results with a bit of text before and after the actual match; the second parameter specifies that this must be 40 characters. It outputs an HTML fragment with span elements with different CSS classes for the trailing part, the match, and the leading part of the output. You can use this to create pretty layouts for the search results (as attempted in the example).

If you run the example and search on, for instance, distinguish, the results look like Figure 12-1.

Configuring Full-Text Indexes

Configuring a full-text index is done in the same collection.xconf document we used for the other indexes. For more information on where to locate such a document and its basic syntax, please refer to “Configuring Indexes”.

Figure 12-1. Example output for a search on “distinguish”

The full-text index definition is a child of the index element and, as is true of everything in a collection.xconf file, it is in the http://exist-db.org/collection-config/1.0 namespace. It has the following structure:

<lucene>

analyzer*

text+

( inline | ignore )*

</lucene>

§ The optional analyzer element allows you to change the analyzer class(es) Lucene uses to analyze the text and/or pass it parameters. This is an advanced topic, explained in “Defining and Configuring the Lucene Analyzer”.

§ The text element defines which elements/attributes Lucene creates an index for. See more about this in “Choosing the correct context”.

§ The inline and ignore elements are important when you’re indexing mixed content. They can be defined either globally (as a child of lucene) or for a specific text element. Read more about this in “Handling Mixed Content”.

Configuring the Search Context

The lucene/text element defines the context for the full-text index. This is usually an element, but could just as well be an attribute. It has the following structure:

<text qname = string | match = string

boost? = float

analyzer? = NCName >

( inline | ignore )*

</text>

§ qname is the qualified name of the element or attribute (if you start it with an @ character) for which you want the text to be indexed. It can have a namespace prefix (which must, of course, be defined).

Examples of qname usage are qname="tei:p", qname="mytextelement", and qname="@title".

§ With match you can define the search context using a limited XPath-like expression. Only the / and // operators are allowed, plus the wildcard * to match an arbitrary element.

For instance, match="//tei:div/*" will put a full-text index on all direct child elements of tei:div.

You may use either a qname or a match attribute, but not both:

§ The boost attribute gives you the ability to influence the scoring of matches during indexing. It multiplies the default search score with this floating-point value. There’s more about search scores in “Scoring Searches”.

§ analyzer allows you to change the Lucene analyzer class for this index. This is explained in “Defining and Configuring the Lucene Analyzer”.

Additionally, you can define how inline elements must be treated for this particular index using nested inline and/or ignore elements. This is explained in “Handling Mixed Content”.

Choosing the correct context

Defining the correct context for your full-text index is critical. For example, take the following XML fragment:

<Text>

<Heading>eXist index configuration</Heading>

<Content>eXist index configuration is done in

the collection.xconf file</Content>

</Text>

Assume we’ve indexed this with the following index configuration in collection.xconf:

<lucene>

<text qname="Text"/>

</lucene>

Passed to the indexer is the text value of the nodes identified by the qname attribute. So, in this example the indexer will see "eXist index configuration eXist index configuration is done in the collection.xconf file". This is linked to the Text element only; Lucene preserves no knowledge about the child elements of Text.

If you use this configuration, the following query will return the expected Text elements:

//Text[ft:query(., 'index')]

However, the following query will return nothing (i.e., an empty sequence) because Lucene has no index on Heading elements:

//Text[ft:query(Heading, 'index')]

Searching only within the contents of headings is often desirable, so you may in fact want this to work. Luckily, nothing stops you from defining two overlapping indexes:

<lucene>

<text qname="Text"/>

<text qname="Heading"/>

</lucene>

Here is another useful example. Assume we’ve marked up filenames separately, as in:

<Text>

<Heading>eXist index configuration</Heading>

<Content>eXist index configuration is done in the

<Filename>collection.xconf</Filename> file</Content>

</Text>

To give the user the option to search within the full text or within either the filenames only, we define two indexes:

<lucene>

<text qname="Text"/>

<text qname="Filename"/>

</lucene>

Now the following two queries will return the same Text element, even though the search context for the second one is much narrower:

//Text[ft:query(., 'collection.xconf')]

//Text[ft:query(Filename, 'collection.xconf')]

Search context and performance

Full-text indexing is an expensive operation and can have a huge impact on the performance of storing and updating documents, so use it wisely.

It’s important not to define full-text indexes too broadly. For instance, a classical mistake is defining your full-text indexes as:

<text match="//*" />

This will create an index on all elements anywhere in your document. That may sound simple and attractive, but it will cost you dearly in terms of performance. Remember that the index is created over the text contents of a node, so an index on the root element will index all text in the document. The same is true for all other elements, so every piece of text is indexed multiple times. Depending on how deeply the text is nested in the document, this may be slow and create a huge number of index files.

So, the best strategy for full-text indexes is to define them as narrowly as you can. And be careful using wildcards, because they can quickly get out of hand!

Handling Mixed Content

You can decide how to handle mixed content by using the inline and ignore elements. These elements can appear globally (as children of the lucene element) or per index (as children of the text element). inline also has an effect on how Lucene treats whitespace. They have the following format:

<inline qname = string />

<ignore qname = string />

qname holds the qualified name (with an optional namespace prefix) of the inline element.

Inline content and whitespace

By default, Lucene treats inline elements as token separators, which may or may not be what you want. For instance, assume we have an XML fragment like:

<p>This is <b>un</b>clear.</p>

Because of the b inline element, Lucene will see this as "This is un clear." (notice the space between un and clear)—probably not what you intended! To address this, use an index definition like:

<lucene>

<text qname="p">

<inline qname="b"/>

</text>

</lucene>

Or, if the b element is always an inline element in all other elements of the collections documents:

<lucene>

<text qname="p"/>

<!-- other text indexes -->

<inline qname="b"/>

</lucene>

Ignoring inline content

You can tell eXist to completely ignore inline content by using the ignore element. This is useful when, for instance, your content contains editorial notes like:

<p>Columbus discovered Finland in 1492

<note>I don't think the year is correct, could someone check this?</note></p>

Ignoring the note elements within the p elements can be done with:

<lucene>

<text qname="p">

<ignore qname="note"/>

</text>

</lucene>

Or, when note should be ignored in all documents in the collection:

<lucene>

<text qname="p"/>

<!-- other text indexes -->

<ignore qname="note"/>

</lucene>

An ignore element only ignores descendants of the indexed element. This means that a seemingly contradictory index definition like this one is perfectly valid:

<lucene>

<text qname="p"/>

<text qname="note"/>

<ignore qname="note"/>

</lucene>

With this definition, the following query would return nothing:

//p[ft:query(., 'check this')]

But an editor searching on notes within paragraphs would get a result by using:

//p[ft:query(note, 'check this')]

Maintaining the Full-Text Index

Basic maintenance of the full-text index is the same as for the other indexes (see “Maintaining Indexes”): once defined, eXist maintains them automatically for the most part. But when you create a new one or change a configuration, you have to reindex manually.

NOTE

In previous versions of eXist it was necessary to call ft:optimize now and then for optimal performance. This is no longer the case.

Searching with the Full-Text Index

Using the full-text index to search for words and phrases is done through the extension functions in the lucene extension module (see ft). The default namespace prefix for this module is ft.

Basic Search Operations

The basic function for using full-text indexes is ft:query. We’ve already seen some examples of its use in “Full-Text Index and KWIC Example”. Its full definition is:

ft:query($nodes, $query, [$options])

where:

$nodes

Contains the node set to search.

$query

Contains the search query. If this is a string, it is assumed to be in Lucene’s native query syntax (described in the next section). For more complex queries, you may provide an XML fragment as described in “The full-text query XML specification”.

$options

An optional parameter that contains additional query options. See “Additional search parameters”.

Lucene’s native query syntax

Lucene has a native query syntax for defining full-text searches. Its full definition can be found at http://lucene.apache.org/core/3_6_1/queryparsersyntax.html. Here are some examples:

§ exist database searches for text with the terms exist and/or database. This is equivalent to writing exist OR database.

§ You can use wildcards like data* for multiple unknown characters, or database? for a single unknown character.

§ If you want to search on a phrase (multiple words), use quotes: "exist database".

§ You can also do a proximity searche. "exist database"~10 means that the words “eXist” and “database” must occur within 10 words of each other.

§ For a fuzzy search (words like the search term), add a tilde character at the end, as in database~.

§ Add a + in front of words and phrases that must occur. +exist database means the text must include the word exist but may include the word database.

§ Add a - in front of words and phrases that should not occur in the text.

§ Boolean operators (AND, OR, and NOT) are supported.

§ You can group expressions using parentheses.

The full-text query XML specification

The ft:query function also accepts an XML fragment that allows you to build a query using Lucene’s internal API indirectly. The XML is transformed into an internal representation used by Lucene and then executed. This fragment takes the following form:

<query>

( term | wildcard | regex | phrase | near | bool )+

</query>

All subelements accept an optional boost attribute of type xs:float to specify a boost value for the score (see “Scoring Searches”). The query element can contain the following subelements:

§ The term element defines a single term to search for. The following example searches for text containing exist and/or database:

§ <query>

§ <term>exist</term>

§ <term>database</term>

</query>

§ The wildcard element is the same as the term element but can contain a wildcard * or ? operator. To search for all text with words starting with data, use:

§ <query>

§ <wildcard>data*</wildcard>

</query>

§ The regex element contains a regular expression used for the search.

§ The phrase element searches for a group of terms in the correct order. It can contain text (which is tokenized into terms), or a number of term child elements. The following two examples are equivalent:

§ <query>

§ <phrase>exist database</phrase>

§ </query>

§ <query>

§ <phrase>

§ <term>exist</term>

§ <term>database</term>

§ </phrase>

</query>

§ With the near element, you can build even more specific phrase queries. Its syntax is:

§ <near slop? = integer

§ ordered? = "yes" | "no" >

§ #PCDATA | ( term | first | near )+

</near>

§ The optional slop attribute allows you to define the “slop” for the matching. Slop is the maximum number of other words between the words searched upon.

§ The optional ordered attribute defines whether or not the terms must be in the defined order. The default is "yes".

§ If the near element contains character data, this is tokenized. The effect is the same as using the phrase element with character data.

§ Instead of tokenized character data, you can use nested term elements.

§ The first element allows you to search against the start of the text. It has an optional end attribute to specify the maximum distance (in words) allowed from the start of the text:

§ <first end? = integer >

§ #PCDATA | ( term | near )+

</first>

§ To allow even more complex search expressions, you can nest near elements within one another, or within first elements.

For instance, the following expression will search for nodes with the word exist somewhere in the first 4 words of the text and the word database within 10 other words from this:

<query>

<near slop="10">

<first end="4">exist</first>

<term>database</term>

</query>

§ The bool element allows you to combine the other elements into a Boolean expression. For this, all elements accept an occur = "must" | "should" | "not" attribute:

§ "must" means that this part of the query must be matched.

§ "should" means that this part of the query should be matched, but doesn’t necessarily need to be.

§ "not" means that this part of the query must not be matched.

For instance, searching for text that contains the term exist but not the term database can be expressed by:

<query>

<bool>

<term occur="must">exist</term>

<term occur="not">database</term>

</bool>

</query>

Additional search parameters

The optional third parameter to ft:query contains an XML fragment that sets a number of miscellaneous parameters for the search operation (all elements are optional):

<options>

<default-operator> or | and </default-operator>?

<phrase-slop> integer </phrase-slop>?

<leading-wildcard> yes | no </leading-wildcard>?

<filter-rewrite> yes | no </filter-rewrite>?

</options>

§ The default-operator element sets the default operator with which multiple terms are combined. The default is or.

§ The phrase-slop element sets the maximum distance (measured in words) between terms within phrases. The default is 0.

§ The leading-wildcard element sets whether the wildcard characters * and ? are allowed as the first character of a wildcard expression. The default is no.

§ The filter-rewrite element determines how terms are expanded for wildcard or regular expression searches. If set to yes, Lucene will use a filter to preprocess matching terms. If set to no, all matching terms will be added to a single Boolean query, which is then executed. This may generate a “too many clauses” exception when applied to large datasets. The default is yes.

Scoring Searches

Lucene tries to attach a relevance score to the search results. This is always a positive floating-point number. The higher the number, the more relevant Lucene thinks the result is. You can retrieve the score for a search result by calling the ft:score function. Here is an example:

for $hit indoc('/db/myapp/doc.xml')//p[ft:query(., 'exist database')]

let $score as xs:float := ft:score($hit)

order by $score descending

return

(: results code here :)

How exactly Lucene computes these scores is a complex topic in its own right; you can read more about the specifics of Lucene’s approach here: http://lucene.apache.org/core/3_6_1/scoring.html. In many cases, the specifics are not important; it is enough to trust that the Lucene score is a good approximation of what we, mere humans, consider relevant.

Locating Matches

When you perform a full-text search like //p[ft:query(. 'database')], the results you get are the matching p elements. Some applications, however, need to know where in the text of the resulting elements the actual matches were. For example, if you offer a documentation search, it would be nice to show in the results which pieces of text matched the query.

NOTE

Although used mostly for full-text search results, locating matches also works for NGram search results.

To enable this, eXist not only returns the results of the query, but also invisibly remembers where the matches were. Nothing happens if you don’t use this information, but if you need it, it’s there.

As an example, let’s assume we’ve done a full-text query on the word database that resulted in a single p element:

<p>eXist is a native XML database.</p>

To find out where the matches were, you can call the extension function util:expand on the search result. This will wrap the matches in exist:match elements (the exist namespace prefix is bound to http://exist.sourceforge.net/NS/exist). A call to util:expandon this search result would therefore return:

<p>eXist is a native XML <exist:match

xmlns:exist="http://exist.sourceforge.net/NS/exist">

database</exist:match>.</p>

By default, util:expand will also expand any XIncludes (xi:include elements; see “XInclude”) in the search result. If you don’t want XInclude expansion, you can specify an optional second argument to the function, which accepts serialization parameters (as defined in“Serialization Options”) that you can use to control this.

Using Keywords in Context

As we saw in the previous section, eXist remembers where the matches were for full-text (and NGram) queries. This allows you to use a feature called “keywords in context,” or KWIC, that can show these matches to the user, surrounded by limited parts of the text. If you followed the example explained in “Full-Text Index and KWIC Example”, you’ve seen this in action already.

You can generate KWIC output using the kwic extension module. This is an XQuery module, and thus (as fully explained in Chapter 7) you’ll have to import it explicitly in your query’s prolog:

import module namespace kwic="http://exist-db.org/xquery/kwic";

If you look at the documentation for the kwic module, you’ll see lots of functions; most of these, however, are internal.

The easiest way to use the kwic module is by calling kwic:summarize on a search result. This will return the matches, surrounded with customizable chunks of text, in HTML, ready for display. To find out where these matches are, it uses the match locating functionality as explained in “Locating Matches”. We’ve already seen this in action, in the example at the beginning of this chapter.

The full definition of the kwic:summarize function is:

kwic:summarize($search-result, $options)

The $options parameter accepts a small XML fragment that allows you to customize the function’s 'margin-top:0cm;margin-right:0cm;margin-bottom:0cm; margin-left:20.0pt;margin-bottom:.0001pt;line-height:normal;vertical-align: baseline'><config width = integer

table? = "yes" | "no"

link? = string />

§ width (mandatory) tells KWIC how much text (expressed in characters) to keep before and after the match.

§ Omitting table or setting it to "no" causes the output to be wrapped in a p element:

§ <p>

§ <span class="previous">... text before the </span>

§ <span class="hi">match</span>

§ <span class="following"> and after the match...</span>

</p>

Setting table to "yes" causes the output to be returned in an HTML table row format:

<tr>

<td class="previous">... text before the </td>

<td class="hi">match</td>

<td class="following"> and after the match...</td>

</tr>

§ If you specify link, the match will be enclosed in an a element with the value of this attribute as its target. For example, specifying link="otherpage" will change the output for the match to:

<span class="hi"><a href="otherpage">match</a></span>

Defining and Configuring the Lucene Analyzer

Lucene allows its users to specify how text is analyzed. Analyzers are Java classes, with each one defining a different way of tokenizing and/or filtering text. There are several prebaked analyzers available. If you’re indexing a language other than English, it might be worthwhile to change the analyzer to one especially tailored for your language. Other reasons might include changing the list of stopwords (words ignored by the analyzer).

A list of available analyzers can be found in the Lucene JavaDocs the list of direct subclasses here tells you which analyzers are available.

By default, eXist uses the standard analyzer org.apache.lucene.analysis.standard.StandardAnalyzer. Although called “standard,” it is actually an English analyzer (and contains a list of the most-often-used English stopwords).

You can define and configure a different Lucene analyzer in the Lucene definition of the collection.xconf document, as explained fully in “Defining and Configuring the Lucene Analyzer”. The analyzer element defines the Lucene analyzer to use:

<analyzer class = string1

id? = NCName > 2

param*

</analyzer>

1

class holds the name of the Java class to use for tokenizing and filtering the text; for instance, "org.apache.lucene.analysis.WhitespaceAnalyzer".

2

id defines the identifier for this analyzer. This is for referencing the analyzer (in text elements using the analyzer attribute). If you don’t specify an id, this changes the default analyzer.

An analyzer definition can contain parameters to pass to the analyzer using param elements. These parameters are passed to the constructor of the analyzer class:

<param name = string

type? = string

value? = string >

value*

</param>

§ name is the name of the parameter.

§ type is the (Java) type of the parameter. Several types are currently supported:

java.lang.String

A string that may be either a literal value, the name of a class, or the fully qualified name of an enumeration value, depending on the parameter context.

java.io.File

A path to a file on the filesystem; it must be in the appropriate Java path syntax for the operating system in use.

java.util.Set

Assumed to be a set of java.lang.String. When this is used, we can provide multiple values; for example:

<param name="stopwords" type="java.util.Set">

<value>and</value>

<value>or</value>

<value>the</value>

<value>a</value>

<value>an</value>

<value>this</value>

<value>there</value>

</param>

java.lang.Integer (or int)

An integer.

java.lang.Boolean (or boolean)

A Boolean.

java.lang.reflect.Field

Used to reference a static field from another class. For example, if there were a static field named STOPWORDS in the class org.something.text.Common:

<param

name="stopwords"

type="java.lang.reflect.Field"

value="org.something.text.CommonStopWords"/>

When no type is specified, the default is assumed to be java.lang.String.

§ value contains the value of the parameter, using either the <param name="a" value="b"/> or the <value>a</value> form (when type is a java.util.Set).

If the parameters need more than one value, use embedded value elements instead of the value attribute (not both).

A simple example of changing the analyzer would be to tell Lucene that the text we’re going to index is in Dutch:

<lucene>

<analyzer class="org.apache.lucene.analysis.nl.DutchAnalyzer"/>

<text qname="p"/>

</lucene>

For a more advanced example of defining analyzers and passing parameters, we use the ability of the standard analyzer to define a set of stopwords (as mentioned, these are words to be ignored, like the, a, an, etc.). The following example changes the default analyzer and passes it a set of stopwords in a text file:

<lucene>

<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer">

<param

name="stopwords"

type="java.io.File"

value="/usr/local/exist/webapp/WEB-INF/data/stopwords.txt"/>

</analyzer>

<text qname="p"/>

</lucene>

Now assume you need some other element indexed also, but with a much more limited set of stopwords. This could be accomplished by:

<lucene>

<analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer">

<param name="stopwords" type="java.io.File"

value="/usr/local/exist/webapp/WEB-INF/data/stopwords.txt"/>

</analyzer>

<analyzer id="a2"

class="org.apache.lucene.analysis.standard.StandardAnalyzer">

<param

name="stopwords"

type="java.io.Set">

<value>the</value>

<value>a</value>

<value>an</value>

</param>

</analyzer>

<text qname="p"/>

<text qname="h1" analyzer="a2"/>

</lucene>

Now the h1 element is indexed with the stopwords the, a, and an only.

Manual Full-Text Indexing

There is yet another way to use the Lucene full-text indexer inside eXist. You can manually (through your own XQuery code) create an index associated with a resource in the database. You can then use this index to query the contents of this resource. Interestingly enough, the resource does not have to be an XML document, so, in conjunction with the contentextraction extension module (see contentextraction), you can create indexes to search binary content!

Here is how it works:

1. For some resource in your database (XML or otherwise), extract (or create) the text fragments you want to index. For instance, assume we have an XHTML document for which we want to index all the p and h3 elements. We also want to be able to search the p and h3elements separately.

2. Create an XML fragment with root element doc in which you list all these text fragments and add them to so-called fields. A field can be seen as a subindex on a document, so in our case we create two fields: one for the h3 elements, called headers, and one for the pelements, called paras. Here is the code that does this:

3. declare namespace xhtml="http://www.w3.org/1999/xhtml";

4.

5. let $resource := '/db/path/to/your/xhtml/document'

6. let $index-def :=

7. <doc>

8. {

9. for $header indoc($resource)//xhtml:h3

10. return

11. <field name="headers" store="yes">{ string($header) }</field>

12. }

13. {

14. for $para indoc($resource)//xhtml:p

15. return

16. <field name="paras" store="yes">{ string($para) }</field>

17. }

</doc>

18.Call the ft:index function to create the index for this specific resource:

ft:index($resource, $index-def)

Now Lucene creates an index with two subindexes (fields). It indexes all the text fragments passed in the doc/field elements and stores this information, together with the indexed text (because the store attribute is set to "yes").

If you don’t store the text (by setting the store attribute to "no"), the text is indexed but cannot be retrieved. The only thing you can do with an index without stored text is find out whether or not a certain phrase is present; you can’t get its context.

19.Using such an index is done via the extension function ft:search. The search expression passed must contain the field name as a prefix. So, for instance, to search the paragraphs for the word eXist, you would do something like this:

ft:search($resource, 'paras:eXist')

Which would return an XML fragment like:

<Indexing file="/db/path/to/your/xhtml/document">

<results>

<search uri="/db/path/to/your/xhtml/document" score="0.5260675">

<field name="paras">Please use

<exist:match xmlns:exist="http://exist.sourceforge.net/NS/exist">

eXist</exist:match>

for storing your information. You know why!

</field>

</search>

</results>

</Indexing>

Information about the meaning of the score attribute can be found in “Scoring Searches”.

An interesting use case for this manual index creation is that of indexing binary content. You do so by first extracting the content from the binary resource using the contentextraction extension module (see contentextraction), then creating an index for it as just described. The book’s sample code contains a short example of how to do this in the chapters/indexing/index-binary.xq file (or in the /db/apps/exist-book/chapters/indexing/index-binary.xq file if you have installed the XAR package). There is also an interesting article about content extraction and binary resource indexing available on the eXist wiki.