Parent-Child Relationship - Modeling Your Data - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part VI. Modeling Your Data

Chapter 42. Parent-Child Relationship

The parent-child relationship is similar in nature to the nested model: both allow you to associate one entity with another. The difference is that, with nested objects, all entities live within the same document while, with parent-child, the parent and children are completely separate documents.

The parent-child functionality allows you to associate one document type with another, in a one-to-many relationship—one parent to many children. The advantages that parent-child has over nested objects are as follows:

§ The parent document can be updated without reindexing the children.

§ Child documents can be added, changed, or deleted without affecting either the parent or other children. This is especially useful when child documents are large in number and need to be added or changed frequently.

§ Child documents can be returned as the results of a search request.

Elasticsearch maintains a map of which parents are associated with which children. It is thanks to this map that query-time joins are fast, but it does place a limitation on the parent-child relationship: the parent document and all of its children must live on the same shard.

NOTE

At the time of going to press, the parent-child ID map is held in memory as part of fielddata. There are plans afoot to change the default setting to use doc values by default instead.

Parent-Child Mapping

All that is needed in order to establish the parent-child relationship is to specify which document type should be the parent of a child type. This must be done at index creation time, or with the update-mapping API before the child type has been created.

As an example, let’s say that we have a company that has branches in many cities. We would like to associate employees with the branch where they work. We need to be able to search for branches, individual employees, and employees who work for particular branches, so the nested model will not help. We could, of course, use application-side-joins or data denormalization here instead, but for demonstration purposes we will use parent-child.

All that we have to do is to tell Elasticsearch that the employee type has the branch document type as its _parent, which we can do when we create the index:

PUT /company

{

"mappings": {

"branch": {},

"employee": {

"_parent": {

"type": "branch" 1

}

}

}

}

1

Documents of type employee are children of type branch.

Indexing Parents and Children

Indexing parent documents is no different from any other document. Parents don’t need to know anything about their children:

POST /company/branch/_bulk

{ "index": { "_id": "london" }}

{ "name": "London Westminster", "city": "London", "country": "UK" }

{ "index": { "_id": "liverpool" }}

{ "name": "Liverpool Central", "city": "Liverpool", "country": "UK" }

{ "index": { "_id": "paris" }}

{ "name": "Champs Élysées", "city": "Paris", "country": "France" }

When indexing child documents, you must specify the ID of the associated parent document:

PUT /company/employee/1?parent=london 1

{

"name": "Alice Smith",

"dob": "1970-10-24",

"hobby": "hiking"

}

1

This employee document is a child of the london branch.

This parent ID serves two purposes: it creates the link between the parent and the child, and it ensures that the child document is stored on the same shard as the parent.

In “Routing a Document to a Shard”, we explained how Elasticsearch uses a routing value, which defaults to the _id of the document, to decide which shard a document should belong to. The routing value is plugged into this simple formula:

shard = hash(routing) % number_of_primary_shards

However, if a parent ID is specified, it is used as the routing value instead of the _id. In other words, both the parent and the child use the same routing value—the _id of the parent—and so they are both stored on the same shard.

The parent ID needs to be specified on all single-document requests: when retrieving a child document with a GET request, or when indexing, updating, or deleting a child document. Unlike a search request, which is forwarded to all shards in an index, these single-document requests are forwarded only to the shard that holds the document—if the parent ID is not specified, the request will probably be forwarded to the wrong shard.

The parent ID should also be specified when using the bulk API:

POST /company/employee/_bulk

{ "index": { "_id": 2, "parent": "london" }}

{ "name": "Mark Thomas", "dob": "1982-05-16", "hobby": "diving" }

{ "index": { "_id": 3, "parent": "liverpool" }}

{ "name": "Barry Smith", "dob": "1979-04-01", "hobby": "hiking" }

{ "index": { "_id": 4, "parent": "paris" }}

{ "name": "Adrien Grand", "dob": "1987-05-11", "hobby": "horses" }

WARNING

If you want to change the parent value of a child document, it is not sufficient to just reindex or update the child document—the new parent document may be on a different shard. Instead, you must first delete the old child, and then index the new child.

Finding Parents by Their Children

The has_child query and filter can be used to find parent documents based on the contents of their children. For instance, we could find all branches that have employees born after 1980 with a query like this:

GET /company/branch/_search

{

"query": {

"has_child": {

"type": "employee",

"query": {

"range": {

"dob": {

"gte": "1980-01-01"

}

}

}

}

}

}

Like the nested query, the has_child query could match several child documents, each with a different relevance score. How these scores are reduced to a single score for the parent document depends on the score_mode parameter. The default setting is none, which ignores the child scores and assigns a score of 1.0 to the parents, but it also accepts avg, min, max, and sum.

The following query will return both london and liverpool, but london will get a better score because Alice Smith is a better match than Barry Smith:

GET /company/branch/_search

{

"query": {

"has_child": {

"type": "employee",

"score_mode": "max"

"query": {

"match": {

"name": "Alice Smith"

}

}

}

}

}

TIP

The default score_mode of none is significantly faster than the other modes because Elasticsearch doesn’t need to calculate the score for each child document. Set it to avg, min, max, or sum only if you care about the score.

min_children and max_children

The has_child query and filter both accept the min_children and max_children parameters, which will return the parent document only if the number of matching children is within the specified range.

This query will match only branches that have at least two employees:

GET /company/branch/_search

{

"query": {

"has_child": {

"type": "employee",

"min_children": 2, 1

"query": {

"match_all": {}

}

}

}

}

1

A branch must have at least two employees in order to match.

The performance of a has_child query or filter with the min_children or max_children parameters is much the same as a has_child query with scoring enabled.

HAS_CHILD FILTER

The has_child filter works in the same way as the has_child query, except that it doesn’t support the score_mode parameter. It can be used only in filter context—such as inside a filtered query—and behaves like any other filter: it includes or excludes, but doesn’t score.

While the results of a has_child filter are not cached, the usual caching rules apply to the filter inside the has_child filter.

Finding Children by Their Parents

While a nested query can always return only the root document as a result, parent and child documents are independent and each can be queried independently. The has_child query allows us to return parents based on data in their children, and the has_parent query returns children based on data in their parents.

It looks very similar to the has_child query. This example returns employees who work in the UK:

GET /company/employee/_search

{

"query": {

"has_parent": {

"type": "branch", 1

"query": {

"match": {

"country": "UK"

}

}

}

}

}

1

Returns children who have parents of type branch

The has_parent query also supports the score_mode, but it accepts only two settings: none (the default) and score. Each child can have only one parent, so there is no need to reduce multiple scores into a single score for the child. The choice is simply between using the score (score) or not (none).

HAS_PARENT FILTER

The has_parent filter works in the same way as the has_parent query, except that it doesn’t support the score_mode parameter. It can be used only in filter context—such as inside a filtered query—and behaves like any other filter: it includes or excludes, but doesn’t score.

While the results of a has_parent filter are not cached, the usual caching rules apply to the filter inside the has_parent filter.

Children Aggregation

Parent-child supports a children aggregation as a direct analog to the nested aggregation discussed in “Nested Aggregations”. A parent aggregation (the equivalent of reverse_nested) is not supported.

This example demonstrates how we could determine the favorite hobbies of our employees by country:

GET /company/branch/_search?search_type=count

{

"aggs": {

"country": {

"terms": { 1

"field": "country"

},

"aggs": {

"employees": {

"children": { 2

"type": "employee"

},

"aggs": {

"hobby": {

"terms": { 3

"field": "employee.hobby"

}

}

}

}

}

}

}

}

1

The country field in the branch documents.

2

The children aggregation joins the parent documents with their associated children of type employee.

3

The hobby field from the employee child documents.

Grandparents and Grandchildren

The parent-child relationship can extend across more than one generation—grandchildren can have grandparents—but it requires an extra step to ensure that documents from all generations are indexed on the same shard.

Let’s change our previous example to make the country type a parent of the branch type:

PUT /company

{

"mappings": {

"country": {},

"branch": {

"_parent": {

"type": "country" 1

}

},

"employee": {

"_parent": {

"type": "branch" 2

}

}

}

}

1

branch is a child of country.

2

employee is a child of branch.

Countries and branches have a simple parent-child relationship, so we use the same process as we used in “Indexing Parents and Children”:

POST /company/country/_bulk

{ "index": { "_id": "uk" }}

{ "name": "UK" }

{ "index": { "_id": "france" }}

{ "name": "France" }

POST /company/branch/_bulk

{ "index": { "_id": "london", "parent": "uk" }}

{ "name": "London Westmintster" }

{ "index": { "_id": "liverpool", "parent": "uk" }}

{ "name": "Liverpool Central" }

{ "index": { "_id": "paris", "parent": "france" }}

{ "name": "Champs Élysées" }

The parent ID has ensured that each branch document is routed to the same shard as its parent country document. However, look what would happen if we were to use the same technique with the employee grandchildren:

PUT /company/employee/1?parent=london

{

"name": "Alice Smith",

"dob": "1970-10-24",

"hobby": "hiking"

}

The shard routing of the employee document would be decided by the parent ID—london—but the london document was routed to a shard by its own parent ID—uk. It is very likely that the grandchild would end up on a different shard from its parent and grandparent, which would prevent the same-shard parent-child mapping from functioning.

Instead, we need to add an extra routing parameter, set to the ID of the grandparent, to ensure that all three generations are indexed on the same shard. The indexing request should look like this:

PUT /company/employee/1?parent=london&routing=uk 1

{

"name": "Alice Smith",

"dob": "1970-10-24",

"hobby": "hiking"

}

1

The routing value overrides the parent value.

The parent parameter is still used to link the employee document with its parent, but the routing parameter ensures that it is stored on the same shard as its parent and grandparent. The routing value needs to be provided for all single-document requests.

Querying and aggregating across generations works, as long as you step through each generation. For instance, to find countries where employees enjoy hiking, we need to join countries with branches, and branches with employees:

GET /company/country/_search

{

"query": {

"has_child": {

"type": "branch",

"query": {

"has_child": {

"type": "employee",

"query": {

"match": {

"hobby": "hiking"

}

}

}

}

}

}

}

Practical Considerations

Parent-child joins can be a useful technique for managing relationships when index-time performance is more important than search-time performance, but it comes at a significant cost. Parent-child queries can be 5 to 10 times slower than the equivalent nested query!

Memory Use

At the time of going to press, the parent-child ID map is still held in memory. There are plans to change the map to use doc values instead, which will be a big memory saving. Until that happens, you need to be aware of the following: the string _id field of every parent document has to be held in memory, and every child document requires 8 bytes (a long value) of memory. Actually, it’s a bit less thanks to compression, but this gives you a rough idea.

You can check how much memory is being used by the parent-child cache by consulting the indices-stats API (for a summary at the index level) or the node-stats API (for a summary at the node level):

GET /_nodes/stats/indices/id_cache?human 1

1

Returns memory use of the ID cache summarized by node in a human-friendly format.

Global Ordinals and Latency

Parent-child uses global ordinals to speed up joins. Regardless of whether the parent-child map uses an in-memory cache or on-disk doc values, global ordinals still need to be rebuilt after any change to the index.

The more parents in a shard, the longer global ordinals will take to build. Parent-child is best suited to situations where there are many children for each parent, rather than many parents and few children.

Global ordinals, by default, are built lazily: the first parent-child query or aggregation after a refresh will trigger building of global ordinals. This can introduce a significant latency spike for your users. You can use eager_global_ordinals to shift the cost of building global ordinals from query time to refresh time, by mapping the _parent field as follows:

PUT /company

{

"mappings": {

"branch": {},

"employee": {

"_parent": {

"type": "branch",

"fielddata": {

"loading": "eager_global_ordinals" 1

}

}

}

}

}

1

Global ordinals for the _parent field will be built before a new segment becomes visible to search.

With many parents, global ordinals can take several seconds to build. In this case, it makes sense to increase the refresh_interval so that refreshes happen less often and global ordinals remain valid for longer. This will greatly reduce the CPU cost of rebuilding global ordinals every second.

Multigenerations and Concluding Thoughts

The ability to join multiple generations (see “Grandparents and Grandchildren”) sounds attractive until you think of the costs involved:

§ The more joins you have, the worse performance will be.

§ Each generation of parents needs to have their string _id fields stored in memory, which can consume a lot of RAM.

As you consider your relationship schemes and whether parent-child is right for you, consider this advice about parent-child relationships:

§ Use parent-child relationships sparingly, and only when there are many more children than parents.

§ Avoid using multiple parent-child joins in a single query.

§ Avoid scoring by using the has_child filter, or the has_child query with score_mode set to none.

§ Keep the parent IDs short, so that they require less memory.

Above all: think about the other relationship techniques that we have discussed before reaching for parent-child.