Sorting and Relevance - Getting Started - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part I. Getting Started

Chapter 8. Sorting and Relevance

By default, results are returned sorted by relevance—with the most relevant docs first. Later in this chapter, we explain what we mean by relevance and how it is calculated, but let’s start by looking at the sort parameter and how to use it.

Sorting

In order to sort by relevance, we need to represent relevance as a value. In Elasticsearch, the relevance score is represented by the floating-point number returned in the search results as the _score, so the default sort order is _score descending.

Sometimes, though, you don’t have a meaningful relevance score. For instance, the following query just returns all tweets whose user_id field has the value 1:

GET /_search

{

"query" : {

"filtered" : {

"filter" : {

"term" : {

"user_id" : 1

}

}

}

}

}

Filters have no bearing on _score, and the missing-but-implied match_all query just sets the _score to a neutral value of 1 for all documents. In other words, all documents are considered to be equally relevant.

Sorting by Field Values

In this case, it probably makes sense to sort tweets by recency, with the most recent tweets first. We can do this with the sort parameter:

GET /_search

{

"query" : {

"filtered" : {

"filter" : { "term" : { "user_id" : 1 }}

}

},

"sort": { "date": { "order": "desc" }}

}

You will notice two differences in the results:

"hits" : {

"total" : 6,

"max_score" : null, 1

"hits" : [ {

"_index" : "us",

"_type" : "tweet",

"_id" : "14",

"_score" : null, 1

"_source" : {

"date": "2014-09-24",

...

},

"sort" : [ 1411516800000 ] 2

},

...

}

1

The _score is not calculated, because it is not being used for sorting.

2

The value of the date field, expressed as milliseconds since the epoch, is returned in the sort values.

The first is that we have a new element in each result called sort, which contains the value(s) that was used for sorting. In this case, we sorted on date, which internally is indexed as milliseconds since the epoch. The long number 1411516800000 is equivalent to the date string 2014-09-24 00:00:00 UTC.

The second is that the _score and max_score are both null. Calculating the _score can be quite expensive, and usually its only purpose is for sorting; we’re not sorting by relevance, so it doesn’t make sense to keep track of the _score. If you want the _score to be calculated regardless, you can set the track_scores parameter to true.

TIP

As a shortcut, you can specify just the name of the field to sort on:

"sort": "number_of_children"

Fields will be sorted in ascending order by default, and the _score value in descending order.

Multilevel Sorting

Perhaps we want to combine the _score from a query with the date, and show all matching results sorted first by date, then by relevance:

GET /_search

{

"query" : {

"filtered" : {

"query": { "match": { "tweet": "manage text search" }},

"filter" : { "term" : { "user_id" : 2 }}

}

},

"sort": [

{ "date": { "order": "desc" }},

{ "_score": { "order": "desc" }}

]

}

Order is important. Results are sorted by the first criterion first. Only results whose first sort value is identical will then be sorted by the second criterion, and so on.

Multilevel sorting doesn’t have to involve the _score. You could sort by using several different fields, on geo-distance or on a custom value calculated in a script.

NOTE

Query-string search also supports custom sorting, using the sort parameter in the query string:

GET /_search?sort=date:desc&sort=_score&q=search

Sorting on Multivalue Fields

When sorting on fields with more than one value, remember that the values do not have any intrinsic order; a multivalue field is just a bag of values. Which one do you choose to sort on?

For numbers and dates, you can reduce a multivalue field to a single value by using the min, max, avg, or sum sort modes. For instance, you could sort on the earliest date in each dates field by using the following:

"sort": {

"dates": {

"order": "asc",

"mode": "min"

}

}

String Sorting and Multifields

Analyzed string fields are also multivalue fields, but sorting on them seldom gives you the results you want. If you analyze a string like fine old art, it results in three terms. We probably want to sort alphabetically on the first term, then the second term, and so forth, but Elasticsearch doesn’t have this information at its disposal at sort time.

You could use the min and max sort modes (it uses min by default), but that will result in sorting on either art or old, neither of which was the intent.

In order to sort on a string field, that field should contain one term only: the whole not_analyzed string. But of course we still need the field to be analyzed in order to be able to query it as full text.

The naive approach to indexing the same string in two ways would be to include two separate fields in the document: one that is analyzed for searching, and one that is not_analyzed for sorting.

But storing the same string twice in the _source field is waste of space. What we really want to do is to pass in a single field but to index it in two different ways. All of the core field types (strings, numbers, Booleans, dates) accept a fields parameter that allows you to transform a simple mapping like

"tweet": {

"type": "string",

"analyzer": "english"

}

into a multifield mapping like this:

"tweet": { 1

"type": "string",

"analyzer": "english",

"fields": {

"raw": { 2

"type": "string",

"index": "not_analyzed"

}

}

}

1

The main tweet field is just the same as before: an analyzed full-text field.

2

The new tweet.raw subfield is not_analyzed.

Now, or at least as soon as we have reindexed our data, we can use the tweet field for search and the tweet.raw field for sorting:

GET /_search

{

"query": {

"match": {

"tweet": "elasticsearch"

}

},

"sort": "tweet.raw"

}

WARNING

Sorting on a full-text analyzed field can use a lot of memory. See “Fielddata” for more information.

What Is Relevance?

We’ve mentioned that, by default, results are returned in descending order of relevance. But what is relevance? How is it calculated?

The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.

A query clause generates a _score for each document. How that score is calculated depends on the type of query clause. Different query clauses are used for different purposes: a fuzzy query might determine the _score by calculating how similar the spelling of the found word is to the original search term; a terms query would incorporate the percentage of terms that were found. However, what we usually mean by relevance is the algorithm that we use to calculate how similar the contents of a full-text field are to a full-text query string.

The standard similarity algorithm used in Elasticsearch is known as term frequency/inverse document frequency, or TF/IDF, which takes the following factors into account:

Term frequency

How often does the term appear in the field? The more often, the more relevant. A field containing five mentions of the same term is more likely to be relevant than a field containing just one mention.

Inverse document frequency

How often does each term appear in the index? The more often, the less relevant. Terms that appear in many documents have a lower weight than more-uncommon terms.

Field-length norm

How long is the field? The longer it is, the less likely it is that words in the field will be relevant. A term appearing in a short title field carries more weight than the same term appearing in a long content field.

Individual queries may combine the TF/IDF score with other factors such as the term proximity in phrase queries, or term similarity in fuzzy queries.

Relevance is not just about full-text search, though. It can equally be applied to yes/no clauses, where the more clauses that match, the higher the _score.

When multiple query clauses are combined using a compound query like the bool query, the _score from each of these query clauses is combined to calculate the overall _score for the document.

TIP

We have a whole chapter dedicated to relevance calculations and how to bend them to your will: Chapter 17.

Understanding the Score

When debugging a complex query, it can be difficult to understand exactly how a _score has been calculated. Elasticsearch has the option of producing an explanation with every search result, by setting the explain parameter to true.

GET /_search?explain 1

{

"query" : { "match" : { "tweet" : "honeymoon" }}

}

1

The explain parameter adds an explanation of how the _score was calculated to every result.

NOTE

Adding explain produces a lot of output for every hit, which can look overwhelming, but it is worth taking the time to understand what it all means. Don’t worry if it doesn’t all make sense now; you can refer to this section when you need it. We’ll work through the output for one hit bit by bit.

First, we have the metadata that is returned on normal search requests:

{

"_index" : "us",

"_type" : "tweet",

"_id" : "12",

"_score" : 0.076713204,

"_source" : { ... trimmed ... },

It adds information about the shard and the node that the document came from, which is useful to know because term and document frequencies are calculated per shard, rather than per index:

"_shard" : 1,

"_node" : "mzIVYCsqSWCG_M_ZffSs9Q",

Then it provides the _explanation. Each entry contains a description that tells you what type of calculation is being performed, a value that gives you the result of the calculation, and the details of any subcalculations that were required:

"_explanation": { 1

"description": "weight(tweet:honeymoon in 0)

[PerFieldSimilarity], result of:",

"value": 0.076713204,

"details": [

{

"description": "fieldWeight in 0, product of:",

"value": 0.076713204,

"details": [

{ 2

"description": "tf(freq=1.0), with freq of:",

"value": 1,

"details": [

{

"description": "termFreq=1.0",

"value": 1

}

]

},

{ 3

"description": "idf(docFreq=1, maxDocs=1)",

"value": 0.30685282

},

{ 4

"description": "fieldNorm(doc=0)",

"value": 0.25,

}

]

}

]

}

1

Summary of the score calculation for honeymoon

2

Term frequency

3

Inverse document frequency

4

Field-length norm

WARNING

Producing the explain output is expensive. It is a debugging tool only. Don’t leave it turned on in production.

The first part is the summary of the calculation. It tells us that it has calculated the weight—the TF/IDF—of the term honeymoon in the field tweet, for document 0. (This is an internal document ID and, for our purposes, can be ignored.)

It then provides details of how the weight was calculated:

Term frequency

How many times did the term honeymoon appear in the tweet field in this document?

Inverse document frequency

How many times did the term honeymoon appear in the tweet field of all documents in the index?

Field-length norm

How long is the tweet field in this document? The longer the field, the smaller this number.

Explanations for more-complicated queries can appear to be very complex, but really they just contain more of the same calculations that appear in the preceding example. This information can be invaluable for debugging why search results appear in the order that they do.

TIP

The output from explain can be difficult to read in JSON, but it is easier when it is formatted as YAML. Just add format=yaml to the query string.

Understanding Why a Document Matched

While the explain option adds an explanation for every result, you can use the explain API to understand why one particular document matched or, more important, why it didn’t match.

The path for the request is /index/type/id/_explain, as in the following:

GET /us/tweet/12/_explain

{

"query" : {

"filtered" : {

"filter" : { "term" : { "user_id" : 2 }},

"query" : { "match" : { "tweet" : "honeymoon" }}

}

}

}

Along with the full explanation that we saw previously, we also now have a description element, which tells us this:

"failure to match filter: cache(user_id:[2 TO 2])"

In other words, our user_id filter clause is preventing the document from matching.

Fielddata

Our final topic in this chapter is about an internal aspect of Elasticsearch. While we don’t demonstrate any new techniques here, fielddata is an important topic that we will refer to repeatedly, and is something that you should be aware of.

When you sort on a field, Elasticsearch needs access to the value of that field for every document that matches the query. The inverted index, which performs very well when searching, is not the ideal structure for sorting on field values:

§ When searching, we need to be able to map a term to a list of documents.

§ When sorting, we need to map a document to its terms. In other words, we need to “uninvert” the inverted index.

To make sorting efficient, Elasticsearch loads all the values for the field that you want to sort on into memory. This is referred to as fielddata.

WARNING

Elasticsearch doesn’t just load the values for the documents that matched a particular query. It loads the values from every document in your index, regardless of the document type.

The reason that Elasticsearch loads all values into memory is that uninverting the index from disk is slow. Even though you may need the values for only a few docs for the current request, you will probably need access to the values for other docs on the next request, so it makes sense to load all the values into memory at once, and to keep them there.

Fielddata is used in several places in Elasticsearch:

§ Sorting on a field

§ Aggregations on a field

§ Certain filters (for example, geolocation filters)

§ Scripts that refer to fields

Clearly, this can consume a lot of memory, especially for high-cardinality string fields—string fields that have many unique values—like the body of an email. Fortunately, insufficient memory is a problem that can be solved by horizontal scaling, by adding more nodes to your cluster.

For now, all you need to know is what fielddata is, and to be aware that it can be memory hungry. Later, we will show you how to determine the amount of memory that fielddata is using, how to limit the amount of memory that is available to it, and how to preload fielddata to improve the user experience.