Elasticsearch: The Definitive Guide (2015)
Part VI. Modeling Your Data
Chapter 41. Nested Objects
Given the fact that creating, deleting, and updating a single document in Elasticsearch is atomic, it makes sense to store closely related entities within the same document. For instance, we could store an order and all of its order lines in one document, or we could store a blog post and all of its comments together, by passing an array of comments:
PUT /my_index/blogpost/1
{
"title": "Nest eggs",
"body": "Making your money work...",
"tags": [ "cash", "shares" ],
"comments": [
{
"name": "John Smith",
"comment": "Great article",
"age": 28,
"stars": 4,
"date": "2014-09-01"
},
{
"name": "Alice White",
"comment": "More like this please",
"age": 31,
"stars": 5,
"date": "2014-10-22"
}
]
}
If we rely on dynamic mapping, the comments field will be autocreated as an object field.
Because all of the content is in the same document, there is no need to join blog posts and comments at query time, so searches perform well.
The problem is that the preceding document would match a query like this:
GET /_search
{
"query": {
"bool": {
"must": [
{ "match": { "name": "Alice" }},
{ "match": { "age": 28 }}
]
}
}
}
Alice is 31, not 28!
The reason for this cross-object matching, as discussed in “Arrays of Inner Objects”, is that our beautifully structured JSON document is flattened into a simple key-value format in the index that looks like this:
{
"title": [ eggs, nest ],
"body": [ making, money, work, your ],
"tags": [ cash, shares ],
"comments.name": [ alice, john, smith, white ],
"comments.comment": [ article, great, like, more, please, this ],
"comments.age": [ 28, 31 ],
"comments.stars": [ 4, 5 ],
"comments.date": [ 2014-09-01, 2014-10-22 ]
}
The correlation between Alice and 31, or between John and 2014-09-01, has been irretrievably lost. While fields of type object (see “Multilevel Objects”) are useful for storing a single object, they are useless, from a search point of view, for storing an array of objects.
This is the problem that nested objects are designed to solve. By mapping the commments field as type nested instead of type object, each nested object is indexed as a hidden separate document, something like this:
{
"comments.name": [ john, smith ],
"comments.comment": [ article, great ],
"comments.age": [ 28 ],
"comments.stars": [ 4 ],
"comments.date": [ 2014-09-01 ]
}
{
"comments.name": [ alice, white ],
"comments.comment": [ like, more, please, this ],
"comments.age": [ 31 ],
"comments.stars": [ 5 ],
"comments.date": [ 2014-10-22 ]
}
{
"title": [ eggs, nest ],
"body": [ making, money, work, your ],
"tags": [ cash, shares ]
}
First nested object
Second nested object
The root or parent document
By indexing each nested object separately, the fields within the object maintain their relationships. We can run queries that will match only if the match occurs within the same nested object.
Not only that, because of the way that nested objects are indexed, joining the nested documents to the root document at query time is fast—almost as fast as if they were a single document.
These extra nested documents are hidden; we can’t access them directly. To update, add, or remove a nested object, we have to reindex the whole document. It’s important to note that, the result returned by a search request is not the nested object alone; it is the whole document.
Nested Object Mapping
Setting up a nested field is simple—where you would normally specify type object, make it type nested instead:
PUT /my_index
{
"mappings": {
"blogpost": {
"properties": {
"comments": {
"type": "nested",
"properties": {
"name": { "type": "string" },
"comment": { "type": "string" },
"age": { "type": "short" },
"stars": { "type": "short" },
"date": { "type": "date" }
}
}
}
}
}
}
A nested field accepts the same parameters as a field of type object.
That’s all that is required. Any comments objects would now be indexed as separate nested documents. See the nested type reference docs for more.
Querying a Nested Object
Because nested objects are indexed as separate hidden documents, we can’t query them directly. Instead, we have to use the nested query or nested filter to access them:
GET /my_index/blogpost/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "eggs" }},
{
"nested": {
"path": "comments",
"query": {
"bool": {
"must": [
{ "match": { "comments.name": "john" }},
{ "match": { "comments.age": 28 }}
]
}}}}
]
}}}
The title clause operates on the root document.
The nested clause “steps down” into the nested comments field. It no longer has access to fields in the root document, nor fields in any other nested document.
The comments.name and comments.age clauses operate on the same nested document.
TIP
A nested field can contain other nested fields. Similarly, a nested query can contain other nested queries. The nesting hierarchy is applied as you would expect.
Of course, a nested query could match several nested documents. Each matching nested document would have its own relevance score, but these multiple scores need to be reduced to a single score that can be applied to the root document.
By default, it averages the scores of the matching nested documents. This can be controlled by setting the score_mode parameter to avg, max, sum, or even none (in which case the root document gets a constant score of 1.0).
GET /my_index/blogpost/_search
{
"query": {
"bool": {
"must": [
{ "match": { "title": "eggs" }},
{
"nested": {
"path": "comments",
"score_mode": "max",
"query": {
"bool": {
"must": [
{ "match": { "comments.name": "john" }},
{ "match": { "comments.age": 28 }}
]
}}}}
]
}}}
Give the root document the _score from the best-matching nested document.
NOTE
A nested filter behaves much like a nested query, except that it doesn’t accept the score_mode parameter. It can be used only in filter context—such as inside a filtered query—and it behaves like any other filter: it includes or excludes, but it doesn’t score.
While the results of the nested filter itself are not cached, the usual caching rules apply to the filter inside the nested filter.
Sorting by Nested Fields
It is possible to sort by the value of a nested field, even though the value exists in a separate nested document. To make the result more interesting, we will add another record:
PUT /my_index/blogpost/2
{
"title": "Investment secrets",
"body": "What they don't tell you ...",
"tags": [ "shares", "equities" ],
"comments": [
{
"name": "Mary Brown",
"comment": "Lies, lies, lies",
"age": 42,
"stars": 1,
"date": "2014-10-18"
},
{
"name": "John Smith",
"comment": "You're making it up!",
"age": 28,
"stars": 2,
"date": "2014-10-16"
}
]
}
Imagine that we want to retrieve blog posts that received comments in October, ordered by the lowest number of stars that each blog post received. The search request would look like this:
GET /_search
{
"query": {
"nested": {
"path": "comments",
"filter": {
"range": {
"comments.date": {
"gte": "2014-10-01",
"lt": "2014-11-01"
}
}
}
}
},
"sort": {
"comments.stars": {
"order": "asc",
"mode": "min",
"nested_filter": {
"range": {
"comments.date": {
"gte": "2014-10-01",
"lt": "2014-11-01"
}
}
}
}
}
}
The nested query limits the results to blog posts that received a comment in October.
Results are sorted in ascending (asc) order by the lowest value (min) in the comment.stars field in any matching comments.
The nested_filter in the sort clause is the same as the nested query in the main query clause. The reason is explained next.
Why do we need to repeat the query conditions in the nested_filter? The reason is that sorting happens after the query has been executed. The query matches blog posts that received comments in October, but it returns blog post documents as the result. If we didn’t include thenested_filter clause, we would end up sorting based on any comments that the blog post has ever received, not just those received in October.
Nested Aggregations
In the same way as we need to use the special nested query to gain access to nested objects at search time, the dedicated nested aggregation allows us to aggregate fields in nested objects:
GET /my_index/blogpost/_search?search_type=count
{
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"by_month": {
"date_histogram": {
"field": "comments.date",
"interval": "month",
"format": "yyyy-MM"
},
"aggs": {
"avg_stars": {
"avg": {
"field": "comments.stars"
}
}
}
}
}
}
}
}
The nested aggregation “steps down” into the nested comments object.
Comments are bucketed into months based on the comments.date field.
The average number of stars is calculated for each bucket.
The results show that aggregation has happened at the nested document level:
...
"aggregations": {
"comments": {
"doc_count": 4,
"by_month": {
"buckets": [
{
"key_as_string": "2014-09",
"key": 1409529600000,
"doc_count": 1,
"avg_stars": {
"value": 4
}
},
{
"key_as_string": "2014-10",
"key": 1412121600000,
"doc_count": 3,
"avg_stars": {
"value": 2.6666666666666665
}
}
]
}
}
}
...
There are a total of four comments: one in September and three in October.
reverse_nested Aggregation
A nested aggregation can access only the fields within the nested document. It can’t see fields in the root document or in a different nested document. However, we can step out of the nested scope back into the parent with a reverse_nested aggregation.
For instance, we can find out which tags our commenters are interested in, based on the age of the commenter. The comment.age is a nested field, while the tags are in the root document:
GET /my_index/blogpost/_search?search_type=count
{
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"age_group": {
"histogram": {
"field": "comments.age",
"interval": 10
},
"aggs": {
"blogposts": {
"reverse_nested": {},
"aggs": {
"tags": {
"terms": {
"field": "tags"
}
}
}
}
}
}
}
}
}
}
The nested agg steps down into the comments object.
The histogram agg groups on the comments.age field, in buckets of 10 years.
The reverse_nested agg steps back up to the root document.
The terms agg counts popular terms per age group of the commenter.
The abbreviated results show us the following:
..
"aggregations": {
"comments": {
"doc_count": 4,
"age_group": {
"buckets": [
{
"key": 20,
"doc_count": 2,
"blogposts": {
"doc_count": 2,
"tags": {
"doc_count_error_upper_bound": 0,
"buckets": [
{ "key": "shares", "doc_count": 2 },
{ "key": "cash", "doc_count": 1 },
{ "key": "equities", "doc_count": 1 }
]
}
}
},
...
There are four comments.
There are two comments by commenters between the ages of 20 and 30.
Two blog posts are associated with those comments.
The popular tags in those blog posts are shares, cash, and equities.
When to Use Nested Objects
Nested objects are useful when there is one main entity, like our blogpost, with a limited number of closely related but less important entities, such as comments. It is useful to be able to find blog posts based on the content of the comments, and the nested query and filter provide for fast query-time joins.
The disadvantages of the nested model are as follows:
§ To add, change, or delete a nested document, the whole document must be reindexed. This becomes more costly the more nested documents there are.
§ Search requests return the whole document, not just the matching nested documents. Although there are plans afoot to support returning the best -matching nested documents with the root document, this is not yet supported.
Sometimes you need a complete separation between the main document and its associated entities. This separation is provided by the parent-child relationship.