Extending Your Index Structure - Elasticsearch Server, Second Edition (2014)

Elasticsearch Server, Second Edition (2014)

Chapter 4. Extending Your Index Structure

In the previous chapter, we learned many things about querying Elasticsearch. We saw how to choose fields that will be returned and learned how querying works in Elasticsearch. In addition to that, we now know the basic queries that are available and how to filter our data. What's more, we saw how to highlight the matches in our documents and how to validate our queries. In the end, we saw the compound queries of Elasticsearch and learned how to sort our data. By the end of this chapter, you will have learned the following topics:

· Indexing tree-like structured data

· Indexing data that is not flat

· Modifying your index structure when possible

· Indexing data with relationships by using nested documents

· Indexing data with relationships between them by using the parent-child functionality

Indexing tree-like structures

Trees are everywhere. If you develop a shop application, you would probably have categories. If you look at the filesystem, the files and directories are arranged in tree-like structures. This book can also be represented as a tree: chapters contain topics and topics are divided into subtopics. As you can imagine, Elasticsearch is also capable of indexing tree-like structures. Let's check how we can navigate through this type of data using path_analyzer.

Data structure

First, let's create a simple index structure by using the following lines of code:

curl -XPUT 'localhost:9200/path' -d '{

"settings" : {

"index" : {

"analysis" : {

"analyzer" : {

"path_analyzer" : { "tokenizer" : "path_hierarchy" }

}

}

}

},

"mappings" : {

"category" : {

"properties" : {

"category" : {

"type" : "string",

"fields" : {

"name" : { "type" : "string","index" : "not_analyzed" },

"path" : { "type" : "string","analyzer" : "path_analyzer","store" : true }

}

}

}

}

}

}'

As you can see, we have a single type created—the category type. We will use it to store the information about the location of our document in the tree structure. The idea is simple—we can show the location of the document as a path, in the exact same manner as files and directories are presented on your hard disk drive. For example, in an automotive shop we can have /cars/passenger/sport, /cars/passenger/camper, or /cars/delivery_truck/. However, we need to index this path in three ways. We will use a field named name, which doesn't have any additional processing, and an additional field called path, which will use path_analyzer, which we defined. We will also leave the original value as it is, just in case we want to search it.

Analysis

Now, let's see what Elasticsearch will do with the category path during the analysis process. To see this, we will use the following command line, which uses the analysis API described in the Understanding field analysis section in Chapter 5, Make Your Search Better:

curl -XGET 'localhost:9200/path/_analyze?field=category.path&pretty' -d '/cars/passenger/sport'

The following results were returned by Elasticsearch:

{

"tokens" : [ {

"token" : "/cars",

"start_offset" : 0,

"end_offset" : 5,

"type" : "word",

"position" : 1

}, {

"token" : "/cars/passenger",

"start_offset" : 0,

"end_offset" : 15,

"type" : "word",

"position" : 1

}, {

"token" : "/cars/passenger/sport",

"start_offset" : 0,

"end_offset" : 21,

"type" : "word",

"position" : 1

} ]

}

As we can see, our category path /cars/passenger/sport was processed by Elasticsearch and divided into three tokens. Thanks to this, we can simply find every document that belongs to a given category or its subcategories using the term filter. An example of using filters is as follows:

{

"filter" : {

"term" : { "category.path" : "/cars" }

}

}

Note that we also have the original value indexed in the category.name field. This is handy when we want to find documents from a particular path, ignoring documents that are deeper in the hierarchy.

Indexing data that is not flat

Not all data is flat like the data we have been using so far in this book. Of course, if we are building the system that Elasticsearch will be a part of, we can create a structure that is convenient for Elasticsearch. Of course, the structure can't always be flat, because not all use cases allow that. Let's see how to create mappings that use fully-structured JSON objects.

Data

Let's assume that we have the following data (we will store it in the file named structured_data.json):

{

"book" : {

"author" : {

"name" : {

"firstName" : "Fyodor",

"lastName" : "Dostoevsky"

}

},

"isbn" : "123456789",

"englishTitle" : "Crime and Punishment",

"year" : 1886,

"characters" : [

{

"name" : "Raskolnikov"

},

{

"name" : "Sofia"

}

],

"copies" : 0

}

}

As you can see in the preceding code, the data is not flat; it contains arrays and nested objects. If we would like to create mappings and use the knowledge that we've obtained so far, we will have to flatten the data. However, Elasticsearch allows some degree of structure to be present in the documents and we should be able to create mappings that will be able to handle the preceding example.

Objects

The preceding example shows the structured JSON file. As you can see, the root object in our example file is book. The book object has some additional, simple properties, such as englishTitle. Those will be indexed as normal fields. In addition to that, it has thecharacters array type, which we will discuss in the next paragraph. For now, let's focus on author. As you can see, author is an object, which has another object nested within it—the name object, which has two properties, firstName and lastName.

Arrays

We already used the array type data, but we didn't discuss it in detail. By default, all fields in Lucene and thus in Elasticsearch are multivalued, which means that they can store multiple values. In order to send such fields to be indexed, we use the JSON array type, which is nested within opening and closing square brackets []. As you can see in the preceding example, we used the array type for characters within the book.

Mappings

To index arrays, we just need to specify the properties for such fields inside the array name. So, in our case in order to index the characters data, we would need to add the following mappings:

"characters" : {

"properties" : {

"name" : {"type" : "string", "store" : "yes"}

}

}

Nothing strange, we just nest the properties section inside the array's name (which is characters in our case) and we define the fields there. As a result of the preceding mappings, we would get characters.name as a multivalued field in the index.

Similarly, for the author object, we will call the section with the same name as it is present in the data, but in addition to the properties section, we also inform Elasticsearch that it should expect an object type by adding the type property with the value as object. We have the author object, but it also has the name object nested within it, so we just nest another object inside it. So, our mappings for the author field would look like the following:

"author" : {

"type" : "object",

"properties" : {

"name" : {

"type" : "object",

"properties" : {

"firstName" : {"type" : "string", "index" : "analyzed"},

"lastName" : {"type" : "string", "index" : "analyzed"}

}

}

}

}

The firstName and lastName fields appear in the index as author.name.firstName and author.name.lastName.

The rest of the fields are simple core types, so I'll skip discussing them as they were already discussed in the Mappings configuration section of Chapter 2, Indexing Your Data.

Final mappings

So, our final mappings file, which we've named structured_mapping.json, looks as follows:

{

"book" : {

"properties" : {

"author" : {

"type" : "object",

"properties" : {

"name" : {

"type" : "object",

"properties" : {

"firstName" : {"type" : "string", "store": "yes"},

"lastName" : {"type" : "string", "store": "yes"}

}

}

}

},

"isbn" : {"type" : "string", "store": "yes"},

"englishTitle" : {"type" : "string", "store": "yes"},

"year" : {"type" : "integer", "store": "yes"},

"characters" : {

"properties" : {

"name" : {"type" : "string", "store": "yes"}

}

},

"copies" : {"type" : "integer", "store": "yes"}

}

}

}

As you can see, we set the store property to yes for all of the fields. This is just to show you that the fields were properly indexed.

Sending the mappings to Elasticsearch

Now that we have done our mappings, we would like to test if all of them actually work. This time we will use a slightly different technique to create an index and put the mappings. First, let's create the library index using the following command line:

curl -XPUT 'localhost:9200/library'

Now, let's send our mappings for the book type, using the following command line:

curl -XPUT 'localhost:9200/library/book/_mapping' -d @structured_mapping.json

We can now index our example data using the following command line:

curl -XPOST 'localhost:9200/library/book/1' -d @structured_data.json

To be or not to be dynamic

As we already know, Elasticsearch is schemaless, which means it can index data without the need to create the mappings upfront. The dynamic behavior of Elasticsearch is turned on by default, but there may be situations where you may want to turn it off for some parts of your index. In order to do that, you should add the dynamic property to the given field and set it to false. This should be done on the same level of nesting as the type property for objects that shouldn't be dynamic. For example, if we would like our author andname objects to not be dynamic, we should modify the relevant part of the mappings file so that it looks similar to the following lines of code:

"author" : {

"type" : "object",

"dynamic" : false,

"properties" : {

"name" : {

"type" : "object",

"dynamic" : false,

"properties" : {

"firstName" : {"type" : "string", "index" : "analyzed"},

"lastName" : {"type" : "string", "index" : "analyzed"}

}

}

}

}

However, please remember that in order to add new fields for such objects we will have to update the mappings.

Note

You can also turn off the dynamic mappings functionality by adding the index.mapper.dynamic property to your elasticsearch.yml configuration file and setting it to false.

Using nested objects

Nested objects can come in handy in certain situations. Basically, with nested objects, Elasticsearch allows us to connect multiple documents together—one main document and multiple dependent ones. The main document and the nested ones will be indexed together and they will be placed in the same segment of the index (actually, in the same block), which guarantees the best performance we can get for data structure. The same goes for changing the document; unless you are using the update API, you need to index the parent document and all the other nested documents at the same time.

Note

If you would like to read more about how nested objects work on the Lucene level, there is a very good blog post by Mike McCandless at http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html.

Now, let's get to our example use case. Imagine that we have a shop with clothes and we store the size and color of each t-shirt. Our standard, nonnested mappings will look similar to the following lines of code (stored in cloth.json):

{

"cloth" : {

"properties" : {

"name" : {"type" : "string"},

"size" : {"type" : "string", "index" : "not_analyzed"},

"color" : {"type" : "string", "index" : "not_analyzed"}

}

}

}

Imagine that we have a red t-shirt only in the XXL size and a black one only in the XL size in our shop. So our example document will look like the following code:

{

"name" : "Test shirt",

"size" : [ "XXL", "XL" ],

"color" : [ "red", "black" ]

}

However, there is a problem with this data structure. What if one of our clients searches our shop in order to find the XXL t-shirt in black? Let's check that by running the following query (we assume that we've used our mappings to create the index and we've indexed our example document):

curl -XGET 'localhost:9200/shop/cloth/_search?pretty=true' -d '{

"query" : {

"bool" : {

"must" : [

{

"term" : { "size" : "XXL" }

},

{

"term" : { "color" : "black" }

}

]

}

}

}'

We should get no results right? But, in fact, Elasticsearch returned the following document:

{

(…)

"hits" : {

"total" : 1,

"max_score" : 0.4339554,

"hits" : [ {

"_index" : "shop",

"_type" : "cloth",

"_id" : "1",

"_score" : 0.4339554,

"_source" : { "name" : "Test shirt",

"size" : [ "XXL", "XL" ],

"color" : [ "red", "black" ]}

} ]

}

}

This is because the document was compared; we have the value we are searching for in the size field and in the color field. Of course, this is not what we would like to get.

So, let's modify our mappings to use nested objects to separate color and size to different, nested documents. The final mapping looks like the following (we store these mappings in the cloth_nested.json file):

{

"cloth" : {

"properties" : {

"name" : {"type" : "string", "index" : "analyzed"},

"variation" : {

"type" : "nested",

"properties" : {

"size" : {"type" : "string", "index" : "not_analyzed"},

"color" : {"type" : "string", "index" : "not_analyzed"}

}

}

}

}

}

As you can see, we've introduced a new object, variation, inside our cloth type, which is a nested one (the type property set to nested). It basically says that we will want to index nested documents. Now, let's modify our document. We will add the variation object to it and that object will store objects with two properties: size and color. So, our example product will look as follows:

{

"name" : "Test shirt",

"variation" : [

{ "size" : "XXL", "color" : "red" },

{ "size" : "XL", "color" : "black" }

]

}

We've structured the document so that each size and its matching color is a separate document. However, if you would run our previous query, it wouldn't return any documents. This is because in order to query for nested documents, we need to use a specialized query. So, now our query looks as follows (of course we've created our index and type again):

curl -XGET 'localhost:9200/shop/cloth/_search?pretty=true' -d '{

"query" : {

"nested" : {

"path" : "variation",

"query" : {

"bool" : {

"must" : [

{ "term" : { "variation.size" : "XXL" } },

{ "term" : { "variation.color" : "black" } }

]

}

}

}

}

}'

And now, the preceding query wouldn't return the indexed document, because we don't have a nested document that has a size equal to XXL and the color black.

Let's get back to the query for a second to discuss it briefly. As you can see, we've used the nested query in order to search in the nested documents. The path property specifies the name of the nested object (yes, we can have multiple). As you can see, we just included a standard query section under the nested type. Please also note that we specified the full path for the field names in the nested objects, which is handy when you have multilevel nesting, which is also possible.

Note

If you would like to filter your data on the basis of nested objects, you can do it—there is a nested filter, which has the same functionality as the nested query. Please refer to the Filtering your results section in Chapter 3, Searching Your Data, for more information about filtering.

Scoring and nested queries

There is an additional property when it comes to handling nested documents during queries. In addition to the path property, there is the score_mode property, which allows us to define how the score is calculated from the nested queries. Elasticsearch allows us to set this property to one of the following values:

· avg: This is the default value; using it for the score_mode property will result in Elasticsearch taking the average value calculated from the scores of the defined nested queries. The calculated average will be included in the score of the main query.

· total: This value is used for the score_mode property and it will result in Elasticsearch taking a sum of the scores for each nested query and including it in the score of the main query.

· max: This value is used for the score_mode property and it will result in Elasticsearch taking the score of the maximum scoring nested query and including it in the score of the main query.

· none: This value is used for the score_mode property and it will result in no score being taken from the nested query.

Using the parent-child relationship

In the previous section, we discussed the ability to index nested documents along with the parent one. However, even though the nested documents are indexed as separate documents in the index, we can't change a single nested document (unless we use the update API). However, Elasticsearch allows us to have a real parent-child relationship and we will look at it in the following section.

Index structure and data indexing

Let's use the same example that we used when discussing the nested documents—the hypothetical cloth store. However, what we would like to have is the ability to update sizes and colors without the need to index the whole document after each change.

Parent mappings

The only field we need to have in our parent document is name. We don't need anything more than that. So, in order to create our cloth type in the shop index, we will run the following commands:

curl -XPOST 'localhost:9200/shop'

curl -XPUT 'localhost:9200/shop/cloth/_mapping' -d '{

"cloth" : {

"properties" : {

"name" : {"type" : "string"}

}

}

}'

Child mappings

To create child mappings, we need to add the _parent property with the name of the parent type—cloth, in our case. So, the command that will create the variation type would look as follows:

curl -XPUT 'localhost:9200/shop/variation/_mapping' -d '{

"variation" : {

"_parent" : { "type" : "cloth" },

"properties" : {

"size" : {"type" : "string", "index" : "not_analyzed"},

"color" : {"type" : "string", "index" : "not_analyzed"}

}

}

}'

And, that's all. You don't need to specify which field will be used to connect child documents to the parent ones because, by default, Elasticsearch will use the unique identifier for that. If you remember from the previous chapters, the information about a unique identifier is present in the index by default.

The parent document

Now, we are going to index our parent document. It's very simple; to do that, we just run the usual indexing command, for example, the one as follows:

curl -XPOST 'localhost:9200/shop/cloth/1' -d '{

"name" : "Test shirt"

}'

If you look at the preceding command, you'll notice that our document will be given the identifier 1.

The child documents

To index child documents, we need to provide information about the parent document with the use of the parent request parameter and set that parameter value to the identifier of the parent document. So, to index two child documents to our parent document, we would need to run the following command lines:

curl -XPOST 'localhost:9200/shop/variation/1000?parent=1' -d '{

"color" : "red",

"size" : "XXL"

}'

Also, we need to run the following command lines to index the second child document:

curl -XPOST 'localhost:9200/shop/variation/1001?parent=1' -d '{

"color" : "black",

"size" : "XL"

}'

And that's all. We've indexed two additional documents, which are of a new type, but we've specified that our documents have a parent—the document with an identifier of 1.

Querying

We've indexed our data and now we need to use appropriate queries to match documents with the data stored in their children. Of course, we can also run queries against the child documents and check their parent's existence. However, please note that when running queries against parents, child documents won't be returned, and vice versa.

Querying data in the child documents

So, if we would like to get clothes that are of the XXL size and in red, we would run the following command lines:

curl -XGET 'localhost:9200/shop/_search?pretty' -d '{

"query" : {

"has_child" : {

"type" : "variation",

"query" : {

"bool" : {

"must" : [

{ "term" : { "size" : "XXL" } },

{ "term" : { "color" : "red" } }

]

}

}

}

}

}'

The query is quite simple; it is of the has_child type, which tells Elasticsearch that we want to search in the child documents. In order to specify which type of children we are interested in, we specify the type property with the name of the child type. Then we have a standard bool query, which we've already discussed. The result of the query will contain only parent documents, which in our case will look as follows:

{

(...)

"hits" : {

"total" : 1,

"max_score" : 1.0,

"hits" : [ {

"_index" : "shop",

"_type" : "cloth",

"_id" : "1",

"_score" : 1.0, "_source" : { "name" : "Test shirt" }

} ]

}

}

The top children query

In addition to the has_child query, Elasticsearch exposes one additional query that returns parent documents, but is run against the child documents—the top_children query. That query can be used to run against a specified number of child documents. Let's look at the following query:

{

"query" : {

"top_children" : {

"type" : "variation",

"query" : {

"term" : { "size" : "XXL" }

},

"score" : "max",

"factor" : 10,

"incremental_factor" : 2

}

}

}

The preceding query will be run first against a total of 100 child documents (factor multiplied by the default size parameter of 10). If there are 10 parent documents found (because of the default size parameter being equal to 10), then those will be returned and the query execution will end. However, if fewer parents are returned and there are still child documents that were not queried, another 20 documents will be queried (the incremental_factor parameter multiplied by the result's size), and so on, until the requested amount of parent documents will be found or there are no child documents left to be queried.

The top_children query offers the ability to specify how the score should be calculated with the use of the score parameter, with the value of max (maximum of all the scores of child queries), sum (sum of all the scores of child queries), or avg (average of all the scores of child queries) as the possible ones.

Querying data in the parent documents

If you would like to return child documents that match a given data in the parent document, you should use the has_parent query. It is similar to the has_child query; however, instead of the type property, we specify the parent_type property with the value of the parentdocument type. For example, the following query will return both the child documents that we've indexed, but not the parent document:

curl -XGET 'localhost:9200/shop/_search?pretty' -d '{

"query" : {

"has_parent" : {

"parent_type" : "cloth",

"query" : {

"term" : { "name" : "test" }

}

}

}

}'

The response from Elasticsearch should be similar to the following one:

{

(...)

"hits" : {

"total" : 2,

"max_score" : 1.0,

"hits" : [ {

"_index" : "shop",

"_type" : "variation",

"_id" : "1000",

"_score" : 1.0, "_source" : {"color" : "red","size" : "XXL"}

}, {

"_index" : "shop",

"_type" : "variation",

"_id" : "1001",

"_score" : 1.0, "_source" : {"color" : "black","size" : "XL"}

} ]

}

}

The parent-child relationship and filtering

If you would like to use the parent-child queries as filters, you can; there are has_child and has_parent filters that have the same functionality as queries with corresponding names. Actually, Elasticsearch wraps those filters in the constant score query to allow them to be used as queries.

Performance considerations

When using the Elasticsearch parent-child functionality, you have to be aware of the performance impact that it has. The first thing you need to remember is that the parent and the child documents need to be stored in the same shard in order for the queries to work. If you happen to have a high number of children for a single parent, you may end up with shards not having a similar number of documents. Because of that, your query performance can be lower on one of the nodes, resulting in the whole query being slower. Also, please remember that the parent-child queries will be slower than the ones that run against documents that don't have a relationship between them.

The second very important thing is that when running queries, like the has_child query, Elasticsearch needs to preload and cache the document identifiers. Those identifiers will be stored in the memory and you have to be sure that you have given Elasticsearch enough memory to store those identifiers. Otherwise, you can expect OutOfMemory exceptions to be thrown and your nodes or the whole cluster not being operational.

Finally, as we mentioned, the first query will preload and cache the document identifiers. This takes time. In order to improve the performance of initial queries that use the parent-child relationship, Warmer API can be used. You can find more information about how to add warming queries to Elasticsearch in the Warming up section of Chapter 8, Administrating Your Cluster.

Modifying your index structure with the update API

In the previous chapters, we discussed how to create index mappings and index the data. But what if you already have the mappings created and data indexed, but want to modify the structure of the index? This is possible to some extent. For example, by default, if we index a document with a new field, Elasticsearch will add that field to the index structure. Let's now look at how to modify the index structure manually.

The mappings

Let's assume that we have the following mappings for our users index stored in the user.json file:

{

"user" : {

"properties" : {

"name" : {"type" : "string"}

}

}

}

As you can see, it is very simple. It just has a single property that will hold the username. Now, let's create an index called users, and use the previous mappings to create our own type. To do that, we will run the following commands:

curl -XPOST 'localhost:9200/users'

curl -XPUT 'localhost:9200/users/user/_mapping' -d @user.json

If everything functions correctly, we will have our index and type created. So now, let's try to add a new field to the mappings.

Adding a new field

In order to illustrate how to add a new field to our mappings, we assume that we want to add a phone number to the data stored for each user. In order to do that, we need to send an HTTP PUT command to the /index_name/type_name/_mapping REST endpoint with the proper body that will include our new field. For example, to add the phone field, we would run the following command:

curl -XPUT 'http://localhost:9200/users/user/_mapping' -d '{

"user" : {

"properties" : {

"phone" : {"type" : "string","store" : "yes","index" : "not_analyzed"}

}

}

}'

And again, if everything functions correctly, we should have a new field added to our index structure. To ensure everything is all right, we can run the GET HTTP request to the _mapping REST endpoint and Elasticsearch will return the appropriate mappings. An example command to get the mappings for our user type in the users index could look as follows:

curl -XGET 'localhost:9200/users/user/_mapping?pretty'

Note

After adding a new field to the existing type, we need to index all the documents again, because Elasticsearch didn't update them automatically. This is crucial to remember. You can use your primary source of data to do that or use the _source field to get the original data from it and index it once again.

Modifying fields

So now, our index structure contains two fields: name and phone. We indexed some data, but after a while, we decided that we want to search on the phone field and we would like to change the index property from not_analyzed to analyzed. So, we run the following command:

curl -XPUT 'http://localhost:9200/users/user/_mapping' -d '{

"user" : {

"properties" : {

"phone" : {"type" : "string","store" : "yes","index" : "analyzed"}

}

}

}'

After running the preceding command lines, Elasticsearch returns the following output:

{"error":"MergeMappingException[Merge failed with failures {[mapper [phone] has different index values, mapper [phone] has different 'norms.enabled' values, mapper [phone] has different tokenize values, mapper [phone] has different index_analyzer]}]","status":400}

This is because we can't change the not_analyzed field to analyzed. And not only that, in most cases you won't be able to update the fields mapping. This is a good thing, because if we would be allowed to change such settings, we would confuse Elasticsearch and Lucene. Imagine that we already have many documents with the phone field set to not_analyzed and we are allowed to change the mappings to analyzed. Elasticsearch wouldn't change the data that was already indexed, but the queries that are analyzed would be processed with a different logic and thus you wouldn't be able to properly find your data.

However, to give you some examples of what is prohibited and what is not, we will mention some of the operations for both cases. For example, the following modifications can be safely made:

· Adding a new type definition

· Adding a new field

· Adding a new analyzer

The following modifications are prohibited or will not work:

· Changing the type of the field (for example from text to numeric)

· Changing stored to field to not to be stored and vice versa

· Changing the value of the indexed property

· Changing the analyzer of already indexed documents

Please remember that the preceding mentioned examples of allowed and not allowed updates do not mention all of the possibilities of the Update API usage and you have to try for yourself if the update you are trying to do will work.

Note

If you want to ignore conflicts and just put the new mappings, you can set the ignore_conflicts parameter to true. This will cause Elasticsearch to overwrite your mappings with the one you send. So, our preceding command with the additional parameter would look as follows:

curl -XPUT 'http://localhost:9200/users/user/_mapping?ignore_conflicts=true' -d '...'

Summary

In this chapter, we learned how to index tree-like structures using Elasticsearch. In addition to that, we indexed data that is not flat and modified the structure of already-created indices. Finally, we learned how to handle relationships by using nested documents and by using the Elasticsearch parent-child functionality.

In the next chapter, we'll focus on making our search even better. We will see how Apache Lucene scoring works and why it matters so much. We will learn how to use the Elasticsearch function-score query to adjust the importance of our documents using different functions and we'll leverage the provided scripting capabilities. We will search the content in different languages and discuss when index time-boosting makes sense. We'll use synonyms to match words with the same meaning and we'll learn how to check why a given document was found by a query. Finally, we'll influence queries with boosts, and we will learn how to understand the score calculation done by Elasticsearch.