Structured Search - Search in Depth - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part II. Search in Depth

In Part I we covered the basic tools in just enough detail to allow you to start searching your data with Elasticsearch. It won’t take long, though, before you find that you want more: more flexibility when matching user queries, more-accurate ranking of results, more-specific searches to cover different problem domains.

To move to the next level, it is not enough to just use the match query. You need to understand your data and how you want to be able to search it. The chapters in this part explain how to index and query your data to allow you to take advantage of word proximity, partial matching, fuzzy matching, and language awareness.

Understanding how each query contributes to the relevance _score will help you to tune your queries: to ensure that the documents you consider to be the best results appear on the first page, and to trim the “long tail” of barely relevant results.

Search is not just about full-text search: a large portion of your data will be structured values like dates and numbers. We will start by explaining how to combine structured search with full-text search in the most efficient way.

Chapter 12. Structured Search

Structured search is about interrogating data that has inherent structure. Dates, times, and numbers are all structured: they have a precise format that you can perform logical operations on. Common operations include comparing ranges of numbers or dates, or determining which of two values is larger.

Text can be structured too. A box of crayons has a discrete set of colors: red, green, blue. A blog post may be tagged with keywords distributed and search. Products in an ecommerce store have Universal Product Codes (UPCs) or some other identifier that requires strict and structured formatting.

With structured search, the answer to your question is always a yes or no; something either belongs in the set or it does not. Structured search does not worry about document relevance or scoring; it simply includes or excludes documents.

This should make sense logically. A number can’t be more in a range than any other number that falls in the same range. It is either in the range—or it isn’t. Similarly, for structured text, a value is either equal or it isn’t. There is no concept of more similar.

Finding Exact Values

When working with exact values, you will be working with filters. Filters are important because they are very, very fast. Filters do not calculate relevance (avoiding the entire scoring phase) and are easily cached. We’ll talk about the performance benefits of filters later in “All About Caching”, but for now, just keep in mind that you should use filters as often as you can.

term Filter with Numbers

We are going to explore the term filter first because you will use it often. This filter is capable of handling numbers, Booleans, dates, and text.

Let’s look at an example using numbers first by indexing some products. These documents have a price and a productID:

POST /my_store/products/_bulk

{ "index": { "_id": 1 }}

{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }

{ "index": { "_id": 2 }}

{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }

{ "index": { "_id": 3 }}

{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }

{ "index": { "_id": 4 }}

{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }

Our goal is to find all products with a certain price. You may be familiar with SQL if you are coming from a relational database background. If we expressed this query as an SQL query, it would look like this:

SELECT document

FROM products

WHERE price = 20

In the Elasticsearch query DSL, we use a term filter to accomplish the same thing. The term filter will look for the exact value that we specify. By itself, a term filter is simple. It accepts a field name and the value that we wish to find:

{

"term" : {

"price" : 20

}

}

The term filter isn’t very useful on its own, though. As discussed in “Query DSL”, the search API expects a query, not a filter. To use our term filter, we need to wrap it with a filtered query:

GET /my_store/products/_search

{

"query" : {

"filtered" : { 1

"query" : {

"match_all" : {} 2

},

"filter" : {

"term" : { 3

"price" : 20

}

}

}

}

}

1

The filtered query accepts both a query and a filter.

2

A match_all is used to return all matching documents. This is the default behavior, so in future examples we will simply omit the query section.

3

The term filter that we saw previously. Notice how it is placed inside the filter clause.

Once executed, the search results from this query are exactly what you would expect: only document 2 is returned as a hit (because only 2 had a price of 20):

"hits" : [

{

"_index" : "my_store",

"_type" : "products",

"_id" : "2",

"_score" : 1.0, 1

"_source" : {

"price" : 20,

"productID" : "KDKE-B-9947-#kL5"

}

}

]

1

Filters do not perform scoring or relevance. The score comes from the match_all query, which treats all docs as equal, so all results receive a neutral score of 1.

term Filter with Text

As mentioned at the top of this section, the term filter can match strings just as easily as numbers. Instead of price, let’s try to find products that have a certain UPC identification code. To do this with SQL, we might use a query like this:

SELECT product

FROM products

WHERE productID = "XHDK-A-1293-#fJ3"

Translated into the query DSL, we can try a similar query with the term filter, like so:

GET /my_store/products/_search

{

"query" : {

"filtered" : {

"filter" : {

"term" : {

"productID" : "XHDK-A-1293-#fJ3"

}

}

}

}

}

Except there is a little hiccup: we don’t get any results back! Why is that? The problem isn’t with the the term query; it is with the way the data has been indexed. If we use the analyze API (“Testing Analyzers”), we can see that our UPC has been tokenized into smaller tokens:

GET /my_store/_analyze?field=productID

XHDK-A-1293-#fJ3

{

"tokens" : [ {

"token" : "xhdk",

"start_offset" : 0,

"end_offset" : 4,

"type" : "<ALPHANUM>",

"position" : 1

}, {

"token" : "a",

"start_offset" : 5,

"end_offset" : 6,

"type" : "<ALPHANUM>",

"position" : 2

}, {

"token" : "1293",

"start_offset" : 7,

"end_offset" : 11,

"type" : "<NUM>",

"position" : 3

}, {

"token" : "fj3",

"start_offset" : 13,

"end_offset" : 16,

"type" : "<ALPHANUM>",

"position" : 4

} ]

}

There are a few important points here:

§ We have four distinct tokens instead of a single token representing the UPC.

§ All letters have been lowercased.

§ We lost the hyphen and the hash (#) sign.

So when our term filter looks for the exact value XHDK-A-1293-#fJ3, it doesn’t find anything, because that token does not exist in our inverted index. Instead, there are the four tokens listed previously.

Obviously, this is not what we want to happen when dealing with identification codes, or any kind of precise enumeration.

To prevent this from happening, we need to tell Elasticsearch that this field contains an exact value by setting it to be not_analyzed. We saw this originally in “Customizing Field Mappings”. To do this, we need to first delete our old index (because it has the incorrect mapping) and create a new one with the correct mappings:

DELETE /my_store 1

PUT /my_store 2

{

"mappings" : {

"products" : {

"properties" : {

"productID" : {

"type" : "string",

"index" : "not_analyzed" 3

}

}

}

}

}

1

Deleting the index first is required, since we cannot change mappings that already exist.

2

With the index deleted, we can re-create it with our custom mapping.

3

Here we explicitly say that we don’t want productID to be analyzed.

Now we can go ahead and reindex our documents:

POST /my_store/products/_bulk

{ "index": { "_id": 1 }}

{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }

{ "index": { "_id": 2 }}

{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }

{ "index": { "_id": 3 }}

{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }

{ "index": { "_id": 4 }}

{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }

Only now will our term filter work as expected. Let’s try it again on the newly indexed data (notice, the query and filter have not changed at all, just how the data is mapped):

GET /my_store/products/_search

{

"query" : {

"filtered" : {

"filter" : {

"term" : {

"productID" : "XHDK-A-1293-#fJ3"

}

}

}

}

}

Since the productID field is not analyzed, and the term filter performs no analysis, the query finds the exact match and returns document 1 as a hit. Success!

Internal Filter Operation

Internally, Elasticsearch is performing several operations when executing a filter:

1. Find matching docs.

The term filter looks up the term XHDK-A-1293-#fJ3 in the inverted index and retrieves the list of documents that contain that term. In this case, only document 1 has the term we are looking for.

2. Build a bitset.

The filter then builds a bitset--an array of 1s and 0s—that describes which documents contain the term. Matching documents receive a 1 bit. In our example, the bitset would be [1,0,0,0].

3. Cache the bitset.

Last, the bitset is stored in memory, since we can use this in the future and skip steps 1 and 2. This adds a lot of performance and makes filters very fast.

When executing a filtered query, the filter is executed before the query. The resulting bitset is given to the query, which uses it to simply skip over any documents that have already been excluded by the filter. This is one of the ways that filters can improve performance. Fewer documents evaluated by the query means faster response times.

Combining Filters

The previous two examples showed a single filter in use. In practice, you will probably need to filter on multiple values or fields. For example, how would you express this SQL in Elasticsearch?

SELECT product

FROM products

WHERE (price = 20 OR productID = "XHDK-A-1293-#fJ3")

AND (price != 30)

In these situations, you will need the bool filter. This is a compound filter that accepts other filters as arguments, combining them in various Boolean combinations.

Bool Filter

The bool filter is composed of three sections:

{

"bool" : {

"must" : [],

"should" : [],

"must_not" : [],

}

}

must

All of these clauses must match. The equivalent of AND.

must_not

All of these clauses must not match. The equivalent of NOT.

should

At least one of these clauses must match. The equivalent of OR.

And that’s it! When you need multiple filters, simply place them into the different sections of the bool filter.

NOTE

Each section of the bool filter is optional (for example, you can have a must clause and nothing else), and each section can contain a single filter or an array of filters.

To replicate the preceding SQL example, we will take the two term filters that we used previously and place them inside the should clause of a bool filter, and add another clause to deal with the NOT condition:

GET /my_store/products/_search

{

"query" : {

"filtered" : { 1

"filter" : {

"bool" : {

"should" : [

{ "term" : {"price" : 20}}, 2

{ "term" : {"productID" : "XHDK-A-1293-#fJ3"}} 2

],

"must_not" : {

"term" : {"price" : 30} 3

}

}

}

}

}

}

1

Note that we still need to use a filtered query to wrap everything.

2

These two term filters are children of the bool filter, and since they are placed inside the should clause, at least one of them needs to match.

3

If a product has a price of 30, it is automatically excluded because it matches a must_not clause.

Our search results return two hits, each document satisfying a different clause in the bool filter:

"hits" : [

{

"_id" : "1",

"_score" : 1.0,

"_source" : {

"price" : 10,

"productID" : "XHDK-A-1293-#fJ3" 1

}

},

{

"_id" : "2",

"_score" : 1.0,

"_source" : {

"price" : 20, 2

"productID" : "KDKE-B-9947-#kL5"

}

}

]

1

Matches the term filter for productID = "XHDK-A-1293-#fJ3"

2

Matches the term filter for price = 20

Nesting Boolean Filters

Even though bool is a compound filter and accepts children filters, it is important to understand that bool is just a filter itself. This means you can nest bool filters inside other bool filters, giving you the ability to make arbitrarily complex Boolean logic.

Given this SQL statement:

SELECT document

FROM products

WHERE productID = "KDKE-B-9947-#kL5"

OR ( productID = "JODL-X-1937-#pV7"

AND price = 30 )

We can translate it into a pair of nested bool filters:

GET /my_store/products/_search

{

"query" : {

"filtered" : {

"filter" : {

"bool" : {

"should" : [

{ "term" : {"productID" : "KDKE-B-9947-#kL5"}}, 1

{ "bool" : { 1

"must" : [

{ "term" : {"productID" : "JODL-X-1937-#pV7"}}, 2

{ "term" : {"price" : 30}} 2

]

}}

]

}

}

}

}

}

1

Because the term and the bool are sibling clauses inside the first Boolean should, at least one of these filters must match for a document to be a hit.

2

These two term clauses are siblings in a must clause, so they both have to match for a document to be returned as a hit.

The results show us two documents, one matching each of the should clauses:

"hits" : [

{

"_id" : "2",

"_score" : 1.0,

"_source" : {

"price" : 20,

"productID" : "KDKE-B-9947-#kL5" 1

}

},

{

"_id" : "3",

"_score" : 1.0,

"_source" : {

"price" : 30, 2

"productID" : "JODL-X-1937-#pV7" 2

}

}

]

1

This productID matches the term in the first bool.

2

These two fields match the term filters in the nested bool.

This was a simple example, but it demonstrates how Boolean filters can be used as building blocks to construct complex logical conditions.

Finding Multiple Exact Values

The term filter is useful for finding a single value, but often you’ll want to search for multiple values. What if you want to find documents that have a price of $20 or $30?

Rather than using multiple term filters, you can instead use a single terms filter (note the s at the end). The terms filter is simply the plural version of the singular term filter.

It looks nearly identical to a vanilla term too. Instead of specifying a single price, we are now specifying an array of values:

{

"terms" : {

"price" : [20, 30]

}

}

And like the term filter, we will place it inside a filtered query to use it:

GET /my_store/products/_search

{

"query" : {

"filtered" : {

"filter" : {

"terms" : { 1

"price" : [20, 30]

}

}

}

}

}

1

The terms filter as seen previously, but placed inside the filtered query

The query will return the second, third, and fourth documents:

"hits" : [

{

"_id" : "2",

"_score" : 1.0,

"_source" : {

"price" : 20,

"productID" : "KDKE-B-9947-#kL5"

}

},

{

"_id" : "3",

"_score" : 1.0,

"_source" : {

"price" : 30,

"productID" : "JODL-X-1937-#pV7"

}

},

{

"_id": "4",

"_score": 1.0,

"_source": {

"price": 30,

"productID": "QQPX-R-3956-#aD8"

}

}

]

Contains, but Does Not Equal

It is important to understand that term and terms are contains operations, not equals. What does that mean?

If you have a term filter for { "term" : { "tags" : "search" } }, it will match both of the following documents:

{ "tags" : ["search"] }

{ "tags" : ["search", "open_source"] } 1

1

This document is returned, even though it has terms other than search.

Recall how the term filter works: it checks the inverted index for all documents that contain a term, and then constructs a bitset. In our simple example, we have the following inverted index:

Token

DocIDs

open_source

2

search

1,2

When a term filter is executed for the token search, it goes straight to the corresponding entry in the inverted index and extracts the associated doc IDs. As you can see, both document 1 and document 2 contain the token in the inverted index. Therefore, they are both returned as a result.

NOTE

The nature of an inverted index also means that entire field equality is rather difficult to calculate. How would you determine whether a particular document contains only your request term? You would have to find the term in the inverted index, extract the document IDs, and then scan every row in the inverted index, looking for those IDs to see whether a doc has any other terms.

As you might imagine, that would be tremendously inefficient and expensive. For that reason, term and terms are must contain operations, not must equal exactly.

Equals Exactly

If you do want that behavior—entire field equality—the best way to accomplish it involves indexing a secondary field. In this field, you index the number of values that your field contains. Using our two previous documents, we now include a field that maintains the number of tags:

{ "tags" : ["search"], "tag_count" : 1 }

{ "tags" : ["search", "open_source"], "tag_count" : 2 }

Once you have the count information indexed, you can construct a bool filter that enforces the appropriate number of terms:

GET /my_index/my_type/_search

{

"query": {

"filtered" : {

"filter" : {

"bool" : {

"must" : [

{ "term" : { "tags" : "search" } }, 1

{ "term" : { "tag_count" : 1 } } 2

]

}

}

}

}

}

1

Find all documents that have the term search.

2

But make sure the document has only one tag.

This query will now match only the document that has a single tag that is search, rather than any document that contains search.

Ranges

When dealing with numbers in this chapter, we have so far searched for only exact numbers. In practice, filtering on ranges is often more useful. For example, you might want to find all products with a price greater than $20 and less than $40.

In SQL terms, a range can be expressed as follows:

SELECT document

FROM products

WHERE price BETWEEN 20 AND 40

Elasticsearch has a range filter, which, unsurprisingly, allows you to filter ranges:

"range" : {

"price" : {

"gt" : 20,

"lt" : 40

}

}

The range filter supports both inclusive and exclusive ranges, through combinations of the following options:

§ gt: > greater than

§ lt: < less than

§ gte: >= greater than or equal to

§ lte: <= less than or equal to

GET /my_store/products/_search

{

"query" : {

"filtered" : {

"filter" : {

"range" : {

"price" : {

"gte" : 20,

"lt" : 40

}

}

}

}

}

}

If you need an unbounded range (for example, just >20), omit one of the boundaries:

"range" : {

"price" : {

"gt" : 20

}

}

Ranges on Dates

The range filter can be used on date fields too:

"range" : {

"timestamp" : {

"gt" : "2014-01-01 00:00:00",

"lt" : "2014-01-07 00:00:00"

}

}

When used on date fields, the range filter supports date math operations. For example, if we want to find all documents that have a timestamp sometime in the last hour:

"range" : {

"timestamp" : {

"gt" : "now-1h"

}

}

This filter will now constantly find all documents with a timestamp greater than the current time minus 1 hour, making the filter a sliding window across your documents.

Date math can also be applied to actual dates, rather than a placeholder like now. Just add a double pipe (||) after the date and follow it with a date math expression:

"range" : {

"timestamp" : {

"gt" : "2014-01-01 00:00:00",

"lt" : "2014-01-01 00:00:00||+1M" 1

}

}

1

Less than January 1, 2014 plus one month

Date math is calendar aware, so it knows the number of days in each month, days in a year, and so forth. More details about working with dates can be found in the date format reference documentation.

Ranges on Strings

The range filter can also operate on string fields. String ranges are calculated lexicographically or alphabetically. For example, these values are sorted in lexicographic order:

§ 5, 50, 6, B, C, a, ab, abb, abc, b

NOTE

Terms in the inverted index are sorted in lexicographical order, which is why string ranges use this order.

If we want a range from a up to but not including b, we can use the same range filter syntax:

"range" : {

"title" : {

"gte" : "a",

"lt" : "b"

}

}

BE CAREFUL OF CARDINALITY

Numeric and date fields are indexed in such a way that ranges are efficient to calculate. This is not the case for string fields, however. To perform a range on a string field, Elasticsearch is effectively performing a term filter for every term that falls in the range. This is much slower than a date or numeric range.

String ranges are fine on a field with low cardinality—a small number of unique terms. But the more unique terms you have, the slower the string range will be.

Dealing with Null Values

Think back to our earlier example, where documents have a field named tags. This is a multivalue field. A document may have one tag, many tags, or potentially no tags at all. If a field has no values, how is it stored in an inverted index?

That’s a trick question, because the answer is, it isn’t stored at all. Let’s look at that inverted index from the previous section:

Token

DocIDs

open_source

2

search

1,2

How would you store a field that doesn’t exist in that data structure? You can’t! An inverted index is simply a list of tokens and the documents that contain them. If a field doesn’t exist, it doesn’t hold any tokens, which means it won’t be represented in an inverted index data structure.

Ultimately, this means that a null, [] (an empty array), and [null] are all equivalent. They simply don’t exist in the inverted index!

Obviously, the world is not simple, and data is often missing fields, or contains explicit nulls or empty arrays. To deal with these situations, Elasticsearch has a few tools to work with null or missing values.

exists Filter

The first tool in your arsenal is the exists filter. This filter will return documents that have any value in the specified field. Let’s use the tagging example and index some example documents:

POST /my_index/posts/_bulk

{ "index": { "_id": "1" }}

{ "tags" : ["search"] } 1

{ "index": { "_id": "2" }}

{ "tags" : ["search", "open_source"] } 2

{ "index": { "_id": "3" }}

{ "other_field" : "some data" } 3

{ "index": { "_id": "4" }}

{ "tags" : null } 4

{ "index": { "_id": "5" }}

{ "tags" : ["search", null] } 5

1

The tags field has one value.

2

The tags field has two values.

3

The tags field is missing altogether.

4

The tags field is set to null.

5

The tags field has one value and a null.

The resulting inverted index for our tags field will look like this:

Token

DocIDs

open_source

2

search

1,2,5

Our objective is to find all documents where a tag is set. We don’t care what the tag is, so long as it exists within the document. In SQL parlance, we would use an IS NOT NULL query:

SELECT tags

FROM posts

WHERE tags IS NOT NULL

In Elasticsearch, we use the exists filter:

GET /my_index/posts/_search

{

"query" : {

"filtered" : {

"filter" : {

"exists" : { "field" : "tags" }

}

}

}

}

Our query returns three documents:

"hits" : [

{

"_id" : "1",

"_score" : 1.0,

"_source" : { "tags" : ["search"] }

},

{

"_id" : "5",

"_score" : 1.0,

"_source" : { "tags" : ["search", null] } 1

},

{

"_id" : "2",

"_score" : 1.0,

"_source" : { "tags" : ["search", "open source"] }

}

]

1

Document 5 is returned even though it contains a null value. The field exists because a real-value tag was indexed, so the null had no impact on the filter.

The results are easy to understand. Any document that has terms in the tags field was returned as a hit. The only two documents that were excluded were documents 3 and 4.

missing Filter

The missing filter is essentially the inverse of exists: it returns documents where there is no value for a particular field, much like this SQL:

SELECT tags

FROM posts

WHERE tags IS NULL

Let’s swap the exists filter for a missing filter from our previous example:

GET /my_index/posts/_search

{

"query" : {

"filtered" : {

"filter": {

"missing" : { "field" : "tags" }

}

}

}

}

And, as you would expect, we get back the two docs that have no real values in the tags field—documents 3 and 4:

"hits" : [

{

"_id" : "3",

"_score" : 1.0,

"_source" : { "other_field" : "some data" }

},

{

"_id" : "4",

"_score" : 1.0,

"_source" : { "tags" : null }

}

]

WHEN NULL MEANS NULL

Sometimes you need to be able to distinguish between a field that doesn’t have a value, and a field that has been explicitly set to null. With the default behavior that we saw previously, this is impossible; the data is lost. Luckily, there is an option that we can set that replaces explicit null values with a placeholder value of our choosing.

When specifying the mapping for a string, numeric, Boolean, or date field, you can also set a null_value that will be used whenever an explicit null value is encountered. A field without a value will still be excluded from the inverted index.

When choosing a suitable null_value, ensure the following:

§ It matches the field’s type. You can’t use a string null_value in a field of type date.

§ It is different from the normal values that the field may contain, to avoid confusing real values with null values.

exists/missing on Objects

The exists and missing filters also work on inner objects, not just core types. With the following document

{

"name" : {

"first" : "John",

"last" : "Smith"

}

}

you can check for the existence of name.first and name.last but also just name. However, in “Types and Mappings”, we said that an object like the preceding one is flattened internally into a simple field-value structure, much like this:

{

"name.first" : "John",

"name.last" : "Smith"

}

So how can we use an exists or missing filter on the name field, which doesn’t really exist in the inverted index?

The reason that it works is that a filter like

{

"exists" : { "field" : "name" }

}

is really executed as

{

"bool": {

"should": [

{ "exists": { "field": { "name.first" }}},

{ "exists": { "field": { "name.last" }}}

]

}

}

That also means that if first and last were both empty, the name namespace would not exist.

All About Caching

Earlier in this chapter (“Internal Filter Operation”), we briefly discussed how filters are calculated. At their heart is a bitset representing which documents match the filter. Elasticsearch aggressively caches these bitsets for later use. Once cached, these bitsets can be reused wherever the same filter is used, without having to reevaluate the entire filter again.

These cached bitsets are “smart”: they are updated incrementally. As you index new documents, only those new documents need to be added to the existing bitsets, rather than having to recompute the entire cached filter over and over. Filters are real-time like the rest of the system; you don’t need to worry about cache expiry.

Independent Filter Caching

Each filter is calculated and cached independently, regardless of where it is used. If two different queries use the same filter, the same filter bitset will be reused. Likewise, if a single query uses the same filter in multiple places, only one bitset is calculated and then reused.

Let’s look at this example query, which looks for emails that are either of the following:

§ In the inbox and have not been read

§ Not in the inbox but have been marked as important

"bool": {

"should": [

{ "bool": {

"must": [

{ "term": { "folder": "inbox" }}, 1

{ "term": { "read": false }}

]

}},

{ "bool": {

"must_not": {

"term": { "folder": "inbox" } 1

},

"must": {

"term": { "important": true }

}

}}

]

}

1

These two filters are identical and will use the same bitset.

Even though one of the inbox clauses is a must clause and the other is a must_not clause, the two clauses themselves are identical. This means that the bitset is calculated once for the first clause that is executed, and then the cached bitset is used for the other clause. By the time this query is run a second time, the inbox filter is already cached and so both clauses will use the cached bitset.

This ties in nicely with the composability of the query DSL. It is easy to move filters around, or reuse the same filter in multiple places within the same query. This isn’t just convenient to the developer—it has direct performance benefits.

Controlling Caching

Most leaf filters—those dealing directly with fields like the term filter—are cached, while compound filters, like the bool filter, are not.

NOTE

Leaf filters have to consult the inverted index on disk, so it makes sense to cache them. Compound filters, on the other hand, use fast bit logic to combine the bitsets resulting from their inner clauses, so it is efficient to recalculate them every time.

Certain leaf filters, however, are not cached by default, because it doesn’t make sense to do so:

Script filters

The results from script filters cannot be cached because the meaning of the script is opaque to Elasticsearch.

Geo-filters

The geolocation filters, which we cover in more detail in Part V, are usually used to filter results based on the geolocation of a specific user. Since each user has a unique geolocation, it is unlikely that geo-filters will be reused, so it makes no sense to cache them.

Date ranges

Date ranges that use the now function (for example "now-1h"), result in values accurate to the millisecond. Every time the filter is run, now returns a new time. Older filters will never be reused, so caching is disabled by default. However, when using now with rounding (for example,now/d rounds to the nearest day), caching is enabled by default.

Sometimes the default caching strategy is not correct. Perhaps you have a complicated bool expression that is reused several times in the same query. Or you have a filter on a date field that will never be reused. The default caching strategy can be overridden on almost any filter by setting the _cache flag:

{

"range" : {

"timestamp" : {

"gt" : "2014-01-02 16:15:14" 1

},

"_cache": false 2

}

}

1

It is unlikely that we will reuse this exact timestamp.

2

Disable caching of this filter.

Later chapters provide examples of when it can make sense to override the default caching strategy.

Filter Order

The order of filters in a bool clause is important for performance. More-specific filters should be placed before less-specific filters in order to exclude as many documents as possible, as early as possible.

If Clause A could match 10 million documents, and Clause B could match only 100 documents, then Clause B should be placed before Clause A.

Cached filters are very fast, so they should be placed before filters that are not cacheable. Imagine that we have an index that contains one month’s worth of log events. However, we’re mostly interested only in log events from the previous hour:

GET /logs/2014-01/_search

{

"query" : {

"filtered" : {

"filter" : {

"range" : {

"timestamp" : {

"gt" : "now-1h"

}

}

}

}

}

}

This filter is not cached because it uses the now function, the value of which changes every millisecond. That means that we have to examine one month’s worth of log events every time we run this query!

We could make this much more efficient by combining it with a cached filter: we can exclude most of the month’s data by adding a filter that uses a fixed point in time, such as midnight last night:

"bool": {

"must": [

{ "range" : {

"timestamp" : {

"gt" : "now-1h/d" 1

}

}},

{ "range" : {

"timestamp" : {

"gt" : "now-1h" 2

}

}}

]

}

1

This filter is cached because it uses now rounded to midnight.

2

This filter is not cached because it uses now without rounding.

The now-1h/d clause rounds to the previous midnight and so excludes all documents created before today. The resulting bitset is cached because now is used with rounding, which means that it is executed only once a day, when the value for midnight-last-night changes. The now-1h clause isn’t cached because now produces a time accurate to the nearest millisecond. However, thanks to the first filter, this second filter need only check documents that have been created since midnight.

The order of these clauses is important. This approach works only because the since-midnight clause comes before the last-hour clause. If they were the other way around, then the last-hour clause would need to examine all documents in the index, instead of just documents created since midnight.