Searching—The Basic Tools - Getting Started - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part I. Getting Started

Chapter 5. Searching—The Basic Tools

So far, we have learned how to use Elasticsearch as a simple NoSQL-style distributed document store. We can throw JSON documents at Elasticsearch and retrieve each one by ID. But the real power of Elasticsearch lies in its ability to make sense out of chaos — to turn Big Data into Big Information.

This is the reason that we use structured JSON documents, rather than amorphous blobs of data. Elasticsearch not only stores the document, but also indexes the content of the document in order to make it searchable.

Every field in a document is indexed and can be queried. And it’s not just that. During a single query, Elasticsearch can use all of these indices, to return results at breath-taking speed. That’s something that you could never consider doing with a traditional database.

A search can be any of the following:

§ A structured query on concrete fields like gender or age, sorted by a field like join_date, similar to the type of query that you could construct in SQL

§ A full-text query, which finds all documents matching the search keywords, and returns them sorted by relevance

§ A combination of the two

While many searches will just work out of the box, to use Elasticsearch to its full potential, you need to understand three subjects:

Mapping

How the data in each field is interpreted

Analysis

How full text is processed to make it searchable

Query DSL

The flexible, powerful query language used by Elasticsearch

Each of these is a big subject in its own right, and we explain them in detail in Part II. The chapters in this section introduce the basic concepts of all three—just enough to help you to get an overall understanding of how search works.

We will start by explaining the search API in its simplest form.

TEST DATA

The documents that we will use for test purposes in this chapter can be found in this gist: https://gist.github.com/clintongormley/8579281.

You can copy the commands and paste them into your shell in order to follow along with this chapter.

Alternatively, if you’re in the online version of this book, you can click here to open in Sense (sense_widget.html?snippets/050_Search/Test_data.json).

The Empty Search

The most basic form of the search API is the empty search, which doesn’t specify any query but simply returns all documents in all indices in the cluster:

GET /_search

The response (edited for brevity) looks something like this:

{

"hits" : {

"total" : 14,

"hits" : [

{

"_index": "us",

"_type": "tweet",

"_id": "7",

"_score": 1,

"_source": {

"date": "2014-09-17",

"name": "John Smith",

"tweet": "The Query DSL is really powerful and flexible",

"user_id": 2

}

},

... 9 RESULTS REMOVED ...

],

"max_score" : 1

},

"took" : 4,

"_shards" : {

"failed" : 0,

"successful" : 10,

"total" : 10

},

"timed_out" : false

}

hits

The most important section of the response is hits, which contains the total number of documents that matched our query, and a hits array containing the first 10 of those matching documents—the results.

Each result in the hits array contains the _index, _type, and _id of the document, plus the _source field. This means that the whole document is immediately available to us directly from the search results. This is unlike other search engines, which return just the document ID, requiring you to fetch the document itself in a separate step.

Each element also has a _score. This is the relevance score, which is a measure of how well the document matches the query. By default, results are returned with the most relevant documents first; that is, in descending order of _score. In this case, we didn’t specify any query, so all documents are equally relevant, hence the neutral _score of 1 for all results.

The max_score value is the highest _score of any document that matches our query.

took

The took value tells us how many milliseconds the entire search request took to execute.

shards

The _shards element tells us the total number of shards that were involved in the query and, of them, how many were successful and how many failed. We wouldn’t normally expect shards to fail, but it can happen. If we were to suffer a major disaster in which we lost both the primary and the replica copy of the same shard, there would be no copies of that shard available to respond to search requests. In this case, Elasticsearch would report the shard as failed, but continue to return results from the remaining shards.

timeout

The timed_out value tells us whether the query timed out. By default, search requests do not time out. If low response times are more important to you than complete results, you can specify a timeout as 10 or 10ms (10 milliseconds), or 1s (1 second):

GET /_search?timeout=10ms

Elasticsearch will return any results that it has managed to gather from each shard before the requests timed out.

WARNING

It should be noted that this timeout does not halt the execution of the query; it merely tells the coordinating node to return the results collected so far and to close the connection. In the background, other shards may still be processing the query even though results have been sent.

Use the time-out because it is important to your SLA, not because you want to abort the execution of long-running queries.

Multi-index, Multitype

Did you notice that the results from the preceding empty search contained documents of different types—user and tweet—from two different indices—us and gb?

By not limiting our search to a particular index or type, we have searched across all documents in the cluster. Elasticsearch forwarded the search request in parallel to a primary or replica of every shard in the cluster, gathered the results to select the overall top 10, and returned them to us.

Usually, however, you will want to search within one or more specific indices, and probably one or more specific types. We can do this by specifying the index and type in the URL, as follows:

/_search

Search all types in all indices

/gb/_search

Search all types in the gb index

/gb,us/_search

Search all types in the gb and us indices

/g*,u*/_search

Search all types in any indices beginning with g or beginning with u

/gb/user/_search

Search type user in the gb index

/gb,us/user,tweet/_search

Search types user and tweet in the gb and us indices

/_all/user,tweet/_search

Search types user and tweet in all indices

When you search within a single index, Elasticsearch forwards the search request to a primary or replica of every shard in that index, and then gathers the results from each shard. Searching within multiple indices works in exactly the same way—there are just more shards involved.

TIP

Searching one index that has five primary shards is exactly equivalent to searching five indices that have one primary shard each.

Later, you will see how this simple fact makes it easy to scale flexibly as your requirements change.

Pagination

Our preceding empty search told us that 14 documents in the cluster match our (empty) query. But there were only 10 documents in the hits array. How can we see the other documents?

In the same way as SQL uses the LIMIT keyword to return a single “page” of results, Elasticsearch accepts the from and size parameters:

size

Indicates the number of results that should be returned, defaults to 10

from

Indicates the number of initial results that should be skipped, defaults to 0

If you wanted to show five results per page, then pages 1 to 3 could be requested as follows:

GET /_search?size=5

GET /_search?size=5&from=5

GET /_search?size=5&from=10

Beware of paging too deep or requesting too many results at once. Results are sorted before being returned. But remember that a search request usually spans multiple shards. Each shard generates its own sorted results, which then need to be sorted centrally to ensure that the overall order is correct.

DEEP PAGING IN DISTRIBUTED SYSTEMS

To understand why deep paging is problematic, let’s imagine that we are searching within a single index with five primary shards. When we request the first page of results (results 1 to 10), each shard produces its own top 10 results and returns them to the requesting node, which then sorts all 50 results in order to select the overall top 10.

Now imagine that we ask for page 1,000—results 10,001 to 10,010. Everything works in the same way except that each shard has to produce its top 10,010 results. The requesting node then sorts through all 50,050 results and discards 50,040 of them!

You can see that, in a distributed system, the cost of sorting results grows exponentially the deeper we page. There is a good reason that web search engines don’t return more than 1,000 results for any query.

TIP

In “Reindexing Your Data” we explain how you can retrieve large numbers of documents efficiently.

Search Lite

There are two forms of the search API: a “lite” query-string version that expects all its parameters to be passed in the query string, and the full request body version that expects a JSON request body and uses a rich search language called the query DSL.

The query-string search is useful for running ad hoc queries from the command line. For instance, this query finds all documents of type tweet that contain the word elasticsearch in the tweet field:

GET /_all/tweet/_search?q=tweet:elasticsearch

The next query looks for john in the name field and mary in the tweet field. The actual query is just

+name:john +tweet:mary

but the percent encoding needed for query-string parameters makes it appear more cryptic than it really is:

GET /_search?q=%2Bname%3Ajohn+%2Btweet%3Amary

The + prefix indicates conditions that must be satisfied for our query to match. Similarly a - prefix would indicate conditions that must not match. All conditions without a + or - are optional—the more that match, the more relevant the document.

The _all Field

This simple search returns all documents that contain the word mary:

GET /_search?q=mary

In the previous examples, we searched for words in the tweet or name fields. However, the results from this query mention mary in three fields:

§ A user whose name is Mary

§ Six tweets by Mary

§ One tweet directed at @mary

How has Elasticsearch managed to find results in three different fields?

When you index a document, Elasticsearch takes the string values of all of its fields and concatenates them into one big string, which it indexes as the special _all field. For example, when we index this document:

{

"tweet": "However did I manage before Elasticsearch?",

"date": "2014-09-14",

"name": "Mary Jones",

"user_id": 1

}

it’s as if we had added an extra field called _all with this value:

"However did I manage before Elasticsearch? 2014-09-14 Mary Jones 1"

The query-string search uses the _all field unless another field name has been specified.

TIP

The _all field is a useful feature while you are getting started with a new application. Later, you will find that you have more control over your search results if you query specific fields instead of the _all field. When the _all field is no longer useful to you, you can disable it, as explained in “Metadata: _all Field”.

More Complicated Queries

The next query searches for tweets, using the following criteria:

§ The name field contains mary or john

§ The date is greater than 2014-09-10

§ The _all field contains either of the words aggregations or geo

+name:(mary john) +date:>2014-09-10 +(aggregations geo)

As a properly encoded query string, this looks like the slightly less readable result:

?q=%2Bname%3A(mary+john)+%2Bdate%3A%3E2014-09-10+%2B(aggregations+geo)

As you can see from the preceding examples, this lite query-string search is surprisingly powerful. Its query syntax, which is explained in detail in the Query String Syntax reference docs, allows us to express quite complex queries succinctly. This makes it great for throwaway queries from the command line or during development.

However, you can also see that its terseness can make it cryptic and difficult to debug. And it’s fragile—a slight syntax error in the query string, such as a misplaced -, :, /, or ", and it will return an error instead of results.

Finally, the query-string search allows any user to run potentially slow, heavy queries on any field in your index, possibly exposing private information or even bringing your cluster to its knees!

TIP

For these reasons, we don’t recommend exposing query-string searches directly to your users, unless they are power users who can be trusted with your data and with your cluster.

Instead, in production we usually rely on the full-featured request body search API, which does all of this, plus a lot more. Before we get there, though, we first need to take a look at how our data is indexed in Elasticsearch.