Full-Body Search - Getting Started - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part I. Getting Started

Chapter 7. Full-Body Search

Search lite—a query-string search—is useful for ad hoc queries from the command line. To harness the full power of search, however, you should use the request body search API, so called because most parameters are passed in the HTTP request body instead of in the query string.

Request body search—henceforth known as search—not only handles the query itself, but also allows you to return highlighted snippets from your results, aggregate analytics across all results or subsets of results, and return did-you-mean suggestions, which will help guide your users to the best results quickly.

Empty Search

Let’s start with the simplest form of the search API, the empty search, which returns all documents in all indices:

GET /_search

{} 1

1

This is an empty request body.

Just as with a query-string search, you can search on one, many, or _all indices, and one, many, or all types:

GET /index_2014*/type1,type2/_search

{}

And you can use the from and size parameters for pagination:

GET /_search

{

"from": 30,

"size": 10

}

A GET REQUEST WITH A BODY?

The HTTP libraries of certain languages (notably JavaScript) don’t allow GET requests to have a request body. In fact, some users are suprised that GET requests are ever allowed to have a body.

The truth is that RFC 7231—the RFC that deals with HTTP semantics and content—does not define what should happen to a GET request with a body! As a result, some HTTP servers allow it, and some—especially caching proxies—don’t.

The authors of Elasticsearch prefer using GET for a search request because they feel that it describes the action—retrieving information—better than the POST verb. However, because GET with a request body is not universally supported, the search API also accepts POST requests:

POST /_search

{

"from": 30,

"size": 10

}

The same rule applies to any other GET API that requires a request body.

We present aggregations in depth in Part IV, but for now, we’re going to focus just on the query.

Instead of the cryptic query-string approach, a request body search allows us to write queries by using the query domain-specific language, or query DSL.

Query DSL

The query DSL is a flexible, expressive search language that Elasticsearch uses to expose most of the power of Lucene through a simple JSON interface. It is what you should be using to write your queries in production. It makes your queries more flexible, more precise, easier to read, and easier to debug.

To use the Query DSL, pass a query in the query parameter:

GET /_search

{

"query": YOUR_QUERY_HERE

}

The empty search—{}—is functionally equivalent to using the match_all query clause, which, as the name suggests, matches all documents:

GET /_search

{

"query": {

"match_all": {}

}

}

Structure of a Query Clause

A query clause typically has this structure:

{

QUERY_NAME: {

ARGUMENT: VALUE,

ARGUMENT: VALUE,...

}

}

If it references one particular field, it has this structure:

{

QUERY_NAME: {

FIELD_NAME: {

ARGUMENT: VALUE,

ARGUMENT: VALUE,...

}

}

}

For instance, you can use a match query clause to find tweets that mention elasticsearch in the tweet field:

{

"match": {

"tweet": "elasticsearch"

}

}

The full search request would look like this:

GET /_search

{

"query": {

"match": {

"tweet": "elasticsearch"

}

}

}

Combining Multiple Clauses

Query clauses are simple building blocks that can be combined with each other to create complex queries. Clauses can be as follows:

§ Leaf clauses (like the match clause) that are used to compare a field (or fields) to a query string.

§ Compound clauses that are used to combine other query clauses. For instance, a bool clause allows you to combine other clauses that either must match, must_not match, or should match if possible:

{

"bool": {

"must": { "match": { "tweet": "elasticsearch" }},

"must_not": { "match": { "name": "mary" }},

"should": { "match": { "tweet": "full text" }}

}

}

It is important to note that a compound clause can combine any other query clauses, including other compound clauses. This means that compound clauses can be nested within each other, allowing the expression of very complex logic.

As an example, the following query looks for emails that contain business opportunity and should either be starred, or be both in the Inbox and not marked as spam:

{

"bool": {

"must": { "match": { "email": "business opportunity" }},

"should": [

{ "match": { "starred": true }},

{ "bool": {

"must": { "folder": "inbox" }},

"must_not": { "spam": true }}

}}

],

"minimum_should_match": 1

}

}

Don’t worry about the details of this example yet; we will explain in full later. The important thing to take away is that a compound query clause can combine multiple clauses—both leaf clauses and other compound clauses—into a single query.

Queries and Filters

Although we refer to the query DSL, in reality there are two DSLs: the query DSL and the filter DSL. Query clauses and filter clauses are similar in nature, but have slightly different purposes.

A filter asks a yes|no question of every document and is used for fields that contain exact values:

§ Is the created date in the range 2013 - 2014?

§ Does the status field contain the term published?

§ Is the lat_lon field within 10km of a specified point?

A query is similar to a filter, but also asks the question: How well does this document match?

A typical use for a query is to find documents

§ Best matching the words full text search

§ Containing the word run, but maybe also matching runs, running, jog, or sprint

§ Containing the words quick, brown, and fox—the closer together they are, the more relevant the document

§ Tagged with lucene, search, or java—the more tags, the more relevant the document

A query calculates how relevant each document is to the query, and assigns it a relevance _score, which is later used to sort matching documents by relevance. This concept of relevance is well suited to full-text search, where there is seldom a completely “correct” answer.

Performance Differences

The output from most filter clauses—a simple list of the documents that match the filter—is quick to calculate and easy to cache in memory, using only 1 bit per document. These cached filters can be reused efficiently for subsequent requests.

Queries have to not only find matching documents, but also calculate how relevant each document is, which typically makes queries heavier than filters. Also, query results are not cachable.

Thanks to the inverted index, a simple query that matches just a few documents may perform as well or better than a cached filter that spans millions of documents. In general, however, a cached filter will outperform a query, and will do so consistently.

The goal of filters is to reduce the number of documents that have to be examined by the query.

When to Use Which

As a general rule, use query clauses for full-text search or for any condition that should affect the relevance score, and use filter clauses for everything else.

Most Important Queries and Filters

While Elasticsearch comes with many queries and filters, you will use just a few frequently. We discuss them in much greater detail in Part II but next we give you a quick introduction to the most important queries and filters.

term Filter

The term filter is used to filter by exact values, be they numbers, dates, Booleans, or not_analyzed exact-value string fields:

{ "term": { "age": 26 }}

{ "term": { "date": "2014-09-01" }}

{ "term": { "public": true }}

{ "term": { "tag": "full_text" }}

terms Filter

The terms filter is the same as the term filter, but allows you to specify multiple values to match. If the field contains any of the specified values, the document matches:

{ "terms": { "tag": [ "search", "full_text", "nosql" ] }}

range Filter

The range filter allows you to find numbers or dates that fall into a specified range:

{

"range": {

"age": {

"gte": 20,

"lt": 30

}

}

}

The operators that it accepts are as follows:

gt

Greater than

gte

Greater than or equal to

lt

Less than

lte

Less than or equal to

exists and missing Filters

The exists and missing filters are used to find documents in which the specified field either has one or more values (exists) or doesn’t have any values (missing). It is similar in nature to IS_NULL (missing) and NOT IS_NULL (exists)in SQL:

{

"exists": {

"field": "title"

}

}

These filters are frequently used to apply a condition only if a field is present, and to apply a different condition if it is missing.

bool Filter

The bool filter is used to combine multiple filter clauses using Boolean logic. It accepts three parameters:

must

These clauses must match, like and.

must_not

These clauses must not match, like not.

should

At least one of these clauses must match, like or.

Each of these parameters can accept a single filter clause or an array of filter clauses:

{

"bool": {

"must": { "term": { "folder": "inbox" }},

"must_not": { "term": { "tag": "spam" }},

"should": [

{ "term": { "starred": true }},

{ "term": { "unread": true }}

]

}

}

match_all Query

The match_all query simply matches all documents. It is the default query that is used if no query has been specified:

{ "match_all": {}}

This query is frequently used in combination with a filter—for instance, to retrieve all emails in the inbox folder. All documents are considered to be equally relevant, so they all receive a neutral _score of 1.

match Query

The match query should be the standard query that you reach for whenever you want to query for a full-text or exact value in almost any field.

If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search:

{ "match": { "tweet": "About Search" }}

If you use it on a field containing an exact value, such as a number, a date, a Boolean, or a not_analyzed string field, then it will search for that exact value:

{ "match": { "age": 26 }}

{ "match": { "date": "2014-09-01" }}

{ "match": { "public": true }}

{ "match": { "tag": "full_text" }}

TIP

For exact-value searches, you probably want to use a filter instead of a query, as a filter will be cached.

Unlike the query-string search that we showed in “Search Lite, the match query does not use a query syntax like +user_id:2 +tweet:search. It just looks for the words that are specified. This means that it is safe to expose to your users via a search field; you control what fields they can query, and it is not prone to throwing syntax errors.

multi_match Query

The multi_match query allows to run the same match query on multiple fields:

{

"multi_match": {

"query": "full text search",

"fields": [ "title", "body" ]

}

}

bool Query

The bool query, like the bool filter, is used to combine multiple query clauses. However, there are some differences. Remember that while filters give binary yes/no answers, queries calculate a relevance score instead. The bool query combines the _score from each must or shouldclause that matches. This query accepts the following parameters:

must

Clauses that must match for the document to be included.

must_not

Clauses that must not match for the document to be included.

should

If these clauses match, they increase the _score; otherwise, they have no effect. They are simply used to refine the relevance score for each document.

The following query finds documents whose title field matches the query string how to make millions and that are not marked as spam. If any documents are starred or are from 2014 onward, they will rank higher than they would have otherwise. Documents that match bothconditions will rank even higher:

{

"bool": {

"must": { "match": { "title": "how to make millions" }},

"must_not": { "match": { "tag": "spam" }},

"should": [

{ "match": { "tag": "starred" }},

{ "range": { "date": { "gte": "2014-01-01" }}}

]

}

}

TIP

If there are no must clauses, at least one should clause has to match. However, if there is at least one must clause, no should clauses are required to match.

Combining Queries with Filters

Queries can be used in query context, and filters can be used in filter context. Throughout the Elasticsearch API, you will see parameters with query or filter in the name. These expect a single argument containing either a single query or filter clause respectively. In other words, they establish the outer context as query context or filter context.

Compound query clauses can wrap other query clauses, and compound filter clauses can wrap other filter clauses. However, it is often useful to apply a filter to a query or, less frequently, to use a full-text query as a filter.

To do this, there are dedicated query clauses that wrap filter clauses, and vice versa, thus allowing us to switch from one context to another. It is important to choose the correct combination of query and filter clauses to achieve your goal in the most efficient way.

Filtering a Query

Let’s say we have this query:

{ "match": { "email": "business opportunity" }}

We want to combine it with the following term filter, which will match only documents that are in our inbox:

{ "term": { "folder": "inbox" }}

The search API accepts only a single query parameter, so we need to wrap the query and the filter in another query, called the filtered query:

{

"filtered": {

"query": { "match": { "email": "business opportunity" }},

"filter": { "term": { "folder": "inbox" }}

}

}

We can now pass this query to the query parameter of the search API:

GET /_search

{

"query": {

"filtered": {

"query": { "match": { "email": "business opportunity" }},

"filter": { "term": { "folder": "inbox" }}

}

}

}

Just a Filter

While in query context, if you need to use a filter without a query (for instance, to match all emails in the inbox), you can just omit the query:

GET /_search

{

"query": {

"filtered": {

"filter": { "term": { "folder": "inbox" }}

}

}

}

If a query is not specified it defaults to using the match_all query, so the preceding query is equivalent to the following:

GET /_search

{

"query": {

"filtered": {

"query": { "match_all": {}},

"filter": { "term": { "folder": "inbox" }}

}

}

}

A Query as a Filter

Occasionally, you will want to use a query while you are in filter context. This can be achieved with the query filter, which just wraps a query. The following example shows one way we could exclude emails that look like spam:

GET /_search

{

"query": {

"filtered": {

"filter": {

"bool": {

"must": { "term": { "folder": "inbox" }},

"must_not": {

"query": { 1

"match": { "email": "urgent business proposal" }

}

}

}

}

}

}

}

1

Note the query filter, which is allowing us to use the match query inside a bool filter.

NOTE

You seldom need to use a query as a filter, but we have included it for completeness’ sake. The only time you may need it is when you need to use full-text matching while in filter context.

Validating Queries

Queries can become quite complex and, especially when combined with different analyzers and field mappings, can become a bit difficult to follow. The validate-query API can be used to check whether a query is valid.

GET /gb/tweet/_validate/query

{

"query": {

"tweet" : {

"match" : "really powerful"

}

}

}

The response to the preceding validate request tells us that the query is invalid:

{

"valid" : false,

"_shards" : {

"total" : 1,

"successful" : 1,

"failed" : 0

}

}

Understanding Errors

To find out why it is invalid, add the explain parameter to the query string:

GET /gb/tweet/_validate/query?explain 1

{

"query": {

"tweet" : {

"match" : "really powerful"

}

}

}

1

The explain flag provides more information about why a query is invalid.

Apparently, we’ve mixed up the type of query (match) with the name of the field (tweet):

{

"valid" : false,

"_shards" : { ... },

"explanations" : [ {

"index" : "gb",

"valid" : false,

"error" : "org.elasticsearch.index.query.QueryParsingException:

[gb] No query registered for [tweet]"

} ]

}

Understanding Queries

Using the explain parameter has the added advantage of returning a human-readable description of the (valid) query, which can be useful for understanding exactly how your query has been interpreted by Elasticsearch:

GET /_validate/query?explain

{

"query": {

"match" : {

"tweet" : "really powerful"

}

}

}

An explanation is returned for each index that we query, because each index can have different mappings and analyzers:

{

"valid" : true,

"_shards" : { ... },

"explanations" : [ {

"index" : "us",

"valid" : true,

"explanation" : "tweet:really tweet:powerful"

}, {

"index" : "gb",

"valid" : true,

"explanation" : "tweet:realli tweet:power"

} ]

}

From the explanation, you can see how the match query for the query string really powerful has been rewritten as two single-term queries against the tweet field, one for each term.

Also, for the us index, the two terms are really and powerful, while for the gb index, the terms are realli and power. The reason for this is that we changed the tweet field in the gb index to use the english analyzer.