Proximity Matching - Search in Depth - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part II. Search in Depth

Chapter 15. Proximity Matching

Standard full-text search with TF/IDF treats documents, or at least each field within a document, as a big bag of words. The match query can tell us whether that bag contains our search terms, but that is only part of the story. It can’t tell us anything about the relationship between words.

Consider the difference between these sentences:

§ Sue ate the alligator.

§ The alligator ate Sue.

§ Sue never goes anywhere without her alligator-skin purse.

A match query for sue alligator would match all three documents, but it doesn’t tell us whether the two words form part of the same idea, or even the same paragraph.

Understanding how words relate to each other is a complicated problem, and we can’t solve it by just using another type of query, but we can at least find words that appear to be related because they appear near each other or even right next to each other.

Each document may be much longer than the examples we have presented: Sue and alligator may be separated by paragraphs of other text. Perhaps we still want to return these documents in which the words are widely separated, but we want to give documents in which the words are close together a higher relevance score.

This is the province of phrase matching, or proximity matching.

TIP

In this chapter, we are using the same example documents that we used for the match query.

Phrase Matching

In the same way that the match query is the go-to query for standard full-text search, the match_phrase query is the one you should reach for when you want to find words that are near each other:

GET /my_index/my_type/_search

{

"query": {

"match_phrase": {

"title": "quick brown fox"

}

}

}

Like the match query, the match_phrase query first analyzes the query string to produce a list of terms. It then searches for all the terms, but keeps only documents that contain all of the search terms, in the same positions relative to each other. A query for the phrase quick fox would not match any of our documents, because no document contains the word quick immediately followed by fox.

TIP

The match_phrase query can also be written as a match query with type phrase:

"match": {

"title": {

"query": "quick brown fox",

"type": "phrase"

}

}

Term Positions

When a string is analyzed, the analyzer returns not only a list of terms, but also the position, or order, of each term in the original string:

GET /_analyze?analyzer=standard

Quick brown fox

This returns the following:

{

"tokens": [

{

"token": "quick",

"start_offset": 0,

"end_offset": 5,

"type": "<ALPHANUM>",

"position": 1 1

},

{

"token": "brown",

"start_offset": 6,

"end_offset": 11,

"type": "<ALPHANUM>",

"position": 2 1

},

{

"token": "fox",

"start_offset": 12,

"end_offset": 15,

"type": "<ALPHANUM>",

"position": 3 1

}

]

}

1

The position of each term in the original string.

Positions can be stored in the inverted index, and position-aware queries like the match_phrase query can use them to match only documents that contain all the words in exactly the order specified, with no words in-between.

What Is a Phrase

For a document to be considered a match for the phrase “quick brown fox,” the following must be true:

§ quick, brown, and fox must all appear in the field.

§ The position of brown must be 1 greater than the position of quick.

§ The position of fox must be 2 greater than the position of quick.

If any of these conditions is not met, the document is not considered a match.

TIP

Internally, the match_phrase query uses the low-level span query family to do position-aware matching. Span queries are term-level queries, so they have no analysis phase; they search for the exact term specified.

Thankfully, most people never need to use the span queries directly, as the match_phrase query is usually good enough. However, certain specialized fields, like patent searches, use these low-level queries to perform very specific, carefully constructed positional searches.

Mixing It Up

Requiring exact-phrase matches may be too strict a constraint. Perhaps we do want documents that contain “quick brown fox” to be considered a match for the query “quick fox,” even though the positions aren’t exactly equivalent.

We can introduce a degree of flexibility into phrase matching by using the slop parameter:

GET /my_index/my_type/_search

{

"query": {

"match_phrase": {

"title": {

"query": "quick fox",

"slop": 1

}

}

}

}

The slop parameter tells the match_phrase query how far apart terms are allowed to be while still considering the document a match. By how far apart we mean how many times do you need to move a term in order to make the query and document match?

We’ll start with a simple example. To make the query quick fox match a document containing quick brown fox we need a slop of just 1:

Pos 1 Pos 2 Pos 3

-----------------------------------------------

Doc: quick brown fox

-----------------------------------------------

Query: quick fox

Slop 1: quick ↳ fox

Although all words need to be present in phrase matching, even when using slop, the words don’t necessarily need to be in the same sequence in order to match. With a high enough slop value, words can be arranged in any order.

To make the query fox quick match our document, we need a slop of 3:

Pos 1 Pos 2 Pos 3

-----------------------------------------------

Doc: quick brown fox

-----------------------------------------------

Query: fox quick

Slop 1: fox|quick ↵ 1

Slop 2: quick ↳ fox

Slop 3: quick ↳ fox

1

Note that fox and quick occupy the same position in this step. Switching word order from fox quick to quick fox thus requires two steps, or a slop of 2.

Multivalue Fields

A curious thing can happen when you try to use phrase matching on multivalue fields. Imagine that you index this document:

PUT /my_index/groups/1

{

"names": [ "John Abraham", "Lincoln Smith"]

}

Then run a phrase query for Abraham Lincoln:

GET /my_index/groups/_search

{

"query": {

"match_phrase": {

"names": "Abraham Lincoln"

}

}

}

Surprisingly, our document matches, even though Abraham and Lincoln belong to two different people in the names array. The reason for this comes down to the way arrays are indexed in Elasticsearch.

When John Abraham is analyzed, it produces this:

§ Position 1: john

§ Position 2: abraham

Then when Lincoln Smith is analyzed, it produces this:

§ Position 3: lincoln

§ Position 4: smith

In other words, Elasticsearch produces exactly the same list of tokens as it would have for the single string John Abraham Lincoln Smith. Our example query looks for abraham directly followed by lincoln, and these two terms do indeed exist, and they are right next to each other, so the query matches.

Fortunately, there is a simple workaround for cases like these, called the position_offset_gap, which we need to configure in the field mapping:

DELETE /my_index/groups/ 1

PUT /my_index/_mapping/groups 2

{

"properties": {

"names": {

"type": "string",

"position_offset_gap": 100

}

}

}

1

First delete the groups mapping and all documents of that type.

2

Then create a new groups mapping with the correct values.

The position_offset_gap setting tells Elasticsearch that it should increase the current term position by the specified value for every new array element. So now, when we index the array of names, the terms are emitted with the following positions:

§ Position 1: john

§ Position 2: abraham

§ Position 103: lincoln

§ Position 104: smith

Our phrase query would no longer match a document like this because abraham and lincoln are now 100 positions apart. You would have to add a slop value of 100 in order for this document to match.

Closer Is Better

Whereas a phrase query simply excludes documents that don’t contain the exact query phrase, a proximity query—a phrase query where slop is greater than 0—incorporates the proximity of the query terms into the final relevance _score. By setting a high slop value like 50 or 100, you can exclude documents in which the words are really too far apart, but give a higher score to documents in which the words are closer together.

The following proximity query for quick dog matches both documents that contain the words quick and dog, but gives a higher score to the document in which the words are nearer to each other:

POST /my_index/my_type/_search

{

"query": {

"match_phrase": {

"title": {

"query": "quick dog",

"slop": 50 1

}

}

}

}

1

Note the high slop value.

{

"hits": [

{

"_id": "3",

"_score": 0.75, 1

"_source": {

"title": "The quick brown fox jumps over the quick dog"

}

},

{

"_id": "2",

"_score": 0.28347334, 2

"_source": {

"title": "The quick brown fox jumps over the lazy dog"

}

}

]

}

1

Higher score because quick and dog are close together

2

Lower score because quick and dog are further apart

Proximity for Relevance

Although proximity queries are useful, the fact that they require all terms to be present can make them overly strict. It’s the same issue that we discussed in “Controlling Precision” in Chapter 13: if six out of seven terms match, a document is probably relevant enough to be worth showing to the user, but the match_phrase query would exclude it.

Instead of using proximity matching as an absolute requirement, we can use it as a signal—as one of potentially many queries, each of which contributes to the overall score for each document (see “Most Fields”).

The fact that we want to add together the scores from multiple queries implies that we should combine them by using the bool query.

We can use a simple match query as a must clause. This is the query that will determine which documents are included in our result set. We can trim the long tail with the minimum_should_match parameter. Then we can add other, more specific queries as should clauses. Every one that matches will increase the relevance of the matching docs.

GET /my_index/my_type/_search

{

"query": {

"bool": {

"must": {

"match": { 1

"title": {

"query": "quick brown fox",

"minimum_should_match": "30%"

}

}

},

"should": {

"match_phrase": { 2

"title": {

"query": "quick brown fox",

"slop": 50

}

}

}

}

}

}

1

The must clause includes or excludes documents from the result set.

2

The should clause increases the relevance score of those documents that match.

We could, of course, include other queries in the should clause, where each query targets a specific aspect of relevance.

Improving Performance

Phrase and proximity queries are more expensive than simple match queries. Whereas a match query just has to look up terms in the inverted index, a match_phrase query has to calculate and compare the positions of multiple possibly repeated terms.

The Lucene nightly benchmarks show that a simple term query is about 10 times as fast as a phrase query, and about 20 times as fast as a proximity query (a phrase query with slop). And of course, this cost is paid at search time instead of at index time.

TIP

Usually the extra cost of phrase queries is not as scary as these numbers suggest. Really, the difference in performance is a testimony to just how fast a simple term query is. Phrase queries on typical full-text data usually complete within a few milliseconds, and are perfectly usable in practice, even on a busy cluster.

In certain pathological cases, phrase queries can be costly, but this is unusual. An example of a pathological case is DNA sequencing, where there are many many identical terms repeated in many positions. Using higher slop values in this case results in a huge growth in the number of position calculations.

So what can we do to limit the performance cost of phrase and proximity queries? One useful approach is to reduce the total number of documents that need to be examined by the phrase query.

Rescoring Results

In the preceding section, we discussed using proximity queries just for relevance purposes, not to include or exclude results from the result set. A query may match millions of results, but chances are that our users are interested in only the first few pages of results.

A simple match query will already have ranked documents that contain all search terms near the top of the list. Really, we just want to rerank the top results to give an extra relevance bump to those documents that also match the phrase query.

The search API supports exactly this functionality via rescoring. The rescore phase allows you to apply a more expensive scoring algorithm—like a phrase query—to just the top K results from each shard. These top results are then resorted according to their new scores.

The request looks like this:

GET /my_index/my_type/_search

{

"query": {

"match": { 1

"title": {

"query": "quick brown fox",

"minimum_should_match": "30%"

}

}

},

"rescore": {

"window_size": 50, 2

"query": { 3

"rescore_query": {

"match_phrase": {

"title": {

"query": "quick brown fox",

"slop": 50

}

}

}

}

}

}

1

The match query decides which results will be included in the final result set and ranks results according to TF/IDF.

2

The window_size is the number of top results to rescore, per shard.

3

The only rescoring algorithm currently supported is another query, but there are plans to add more algorithms later.

Finding Associated Words

As useful as phrase and proximity queries can be, they still have a downside. They are overly strict: all terms must be present for a phrase query to match, even when using slop.

The flexibility in word ordering that you gain with slop also comes at a price, because you lose the association between word pairs. While you can identify documents in which sue, alligator, and ate occur close together, you can’t tell whether Sue ate or the alligator ate.

When words are used in conjunction with each other, they express an idea that is bigger or more meaningful than each word in isolation. The two clauses I’m not happy I’m working and I’m happy I’m not working contain the sames words, in close proximity, but have quite different meanings.

If, instead of indexing each word independently, we were to index pairs of words, then we could retain more of the context in which the words were used.

For the sentence Sue ate the alligator, we would not only index each word (or unigram) as a term

["sue", "ate", "the", "alligator"]

but also each word and its neighbor as single terms:

["sue ate", "ate the", "the alligator"]

These word pairs (or bigrams) are known as shingles.

TIP

Shingles are not restricted to being pairs of words; you could index word triplets (trigrams) as well:

["sue ate the", "ate the alligator"]

Trigrams give you a higher degree of precision, but greatly increase the number of unique terms in the index. Bigrams are sufficient for most use cases.

Of course, shingles are useful only if the user enters the query in the same order as in the original document; a query for sue alligator would match the individual words but none of our shingles.

Fortunately, users tend to express themselves using constructs similar to those that appear in the data they are searching. But this point is an important one: it is not enough to index just bigrams; we still need unigrams, but we can use matching bigrams as a signal to increase the relevance score.

Producing Shingles

Shingles need to be created at index time as part of the analysis process. We could index both unigrams and bigrams into a single field, but it is cleaner to keep unigrams and bigrams in separate fields that can be queried independently. The unigram field would form the basis of our search, with the bigram field being used to boost relevance.

First, we need to create an analyzer that uses the shingle token filter:

DELETE /my_index

PUT /my_index

{

"settings": {

"number_of_shards": 1, 1

"analysis": {

"filter": {

"my_shingle_filter": {

"type": "shingle",

"min_shingle_size": 2, 2

"max_shingle_size": 2, 2

"output_unigrams": false 3

}

},

"analyzer": {

"my_shingle_analyzer": {

"type": "custom",

"tokenizer": "standard",

"filter": [

"lowercase",

"my_shingle_filter" 4

]

}

}

}

}

}

1

See “Relevance Is Broken!”.

2

The default min/max shingle size is 2 so we don’t really need to set these.

3

The shingle token filter outputs unigrams by default, but we want to keep unigrams and bigrams separate.

4

The my_shingle_analyzer uses our custom my_shingles_filter token filter.

First, let’s test that our analyzer is working as expected with the analyze API:

GET /my_index/_analyze?analyzer=my_shingle_analyzer

Sue ate the alligator

Sure enough, we get back three terms:

§ sue ate

§ ate the

§ the alligator

Now we can proceed to setting up a field to use the new analyzer.

Multifields

We said that it is cleaner to index unigrams and bigrams separately, so we will create the title field as a multifield (see “String Sorting and Multifields”):

PUT /my_index/_mapping/my_type

{

"my_type": {

"properties": {

"title": {

"type": "string",

"fields": {

"shingles": {

"type": "string",

"analyzer": "my_shingle_analyzer"

}

}

}

}

}

}

With this mapping, values from our JSON document in the field title will be indexed both as unigrams (title) and as bigrams (title.shingles), meaning that we can query these fields independently.

And finally, we can index our example documents:

POST /my_index/my_type/_bulk

{ "index": { "_id": 1 }}

{ "title": "Sue ate the alligator" }

{ "index": { "_id": 2 }}

{ "title": "The alligator ate Sue" }

{ "index": { "_id": 3 }}

{ "title": "Sue never goes anywhere without her alligator skin purse" }

Searching for Shingles

To understand the benefit that the shingles field adds, let’s first look at the results from a simple match query for “The hungry alligator ate Sue”:

GET /my_index/my_type/_search

{

"query": {

"match": {

"title": "the hungry alligator ate sue"

}

}

}

This query returns all three documents, but note that documents 1 and 2 have the same relevance score because they contain the same words:

{

"hits": [

{

"_id": "1",

"_score": 0.44273707, 1

"_source": {

"title": "Sue ate the alligator"

}

},

{

"_id": "2",

"_score": 0.44273707, 1

"_source": {

"title": "The alligator ate Sue"

}

},

{

"_id": "3", 2

"_score": 0.046571054,

"_source": {

"title": "Sue never goes anywhere without her alligator skin purse"

}

}

]

}

1

Both documents contain the, alligator, and ate and so have the same score.

2

We could have excluded document 3 by setting the minimum_should_match parameter. See “Controlling Precision”.

Now let’s add the shingles field into the query. Remember that we want matches on the shingles field to act as a signal—to increase the relevance score—so we still need to include the query on the main title field:

GET /my_index/my_type/_search

{

"query": {

"bool": {

"must": {

"match": {

"title": "the hungry alligator ate sue"

}

},

"should": {

"match": {

"title.shingles": "the hungry alligator ate sue"

}

}

}

}

}

We still match all three documents, but document 2 has now been bumped into first place because it matched the shingled term ate sue.

{

"hits": [

{

"_id": "2",

"_score": 0.4883322,

"_source": {

"title": "The alligator ate Sue"

}

},

{

"_id": "1",

"_score": 0.13422975,

"_source": {

"title": "Sue ate the alligator"

}

},

{

"_id": "3",

"_score": 0.014119488,

"_source": {

"title": "Sue never goes anywhere without her alligator skin purse"

}

}

]

}

Even though our query included the word hungry, which doesn’t appear in any of our documents, we still managed to use word proximity to return the most relevant document first.

Performance

Not only are shingles more flexible than phrase queries, but they perform better as well. Instead of paying the price of a phrase query every time you search, queries for shingles are just as efficient as a simple match query. A small price is paid at index time, because more terms need to be indexed, which also means that fields with shingles use more disk space. However, most applications write once and read many times, so it makes sense to optimize for fast queries.

This is a theme that you will encounter frequently in Elasticsearch: enables you to achieve a lot at search time, without requiring any up-front setup. Once you understand your requirements more clearly, you can achieve better results with better performance by modeling your data correctly at index time.