Low-level Index Control - Mastering Elasticsearch, Second Edition (2015)

Mastering Elasticsearch, Second Edition (2015)

Chapter 6. Low-level Index Control

In the previous chapter, we talked about general shards and the index architecture. We started by learning how to choose the right amount of shards and replicas, and we used routing during indexing and querying, and in conjunction with aliases. We also discussed shard allocation behavior adjustments, and finally, we looked at what query execution preference can bring us.

In this chapter, we will take a deeper dive into more low-level aspects of handling shards in Elasticsearch. By the end of this chapter, you will have learned:

· Altering the Apache Lucene scoring by using different similarity models

· Altering index writing by using codes

· Near real-time indexing and querying

· Data flushing, index refresh, and transaction log handling

· I/O throttling

· Segment merge control and visualization

· Elasticsearch caching

Altering Apache Lucene scoring

With the release of Apache Lucene 4.0 in 2012, all the users of this great full text search library were given the opportunity to alter the default TF/IDF-based algorithm. The Lucene API was changed to allow easier modification and extension of the scoring formula. However, this was not the only change that was made to Lucene when it comes to documents' score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use a different scoring formula for our documents. In this section, we will take a deeper look at what Lucene 4.0 brings and how these features were incorporated into Elasticsearch.

Available similarity models

As already mentioned, the original and default similarity model available before Apache Lucene 4.0 was the TF/IDF model. We already discussed it in detail in the Default Apache Lucene scoring explained section in Chapter 2, Power User Query DSL.

The five new similarity models that we can use are:

· Okapi BM25: This similarity model is based on a probabilistic model that estimates the probability of finding a document for a given query. In order to use this similarity in Elasticsearch, you need to use the BM25 name. The Okapi BM25 similarity is said to perform best when dealing with short text documents where term repetitions are especially hurtful to the overall document score.

· Divergence from randomness (DFR): This similarity model is based on the probabilistic model of the same name. In order to use this similarity in Elasticsearch, you need to use the DFR name. It is said that the divergence from the randomness similarity model performs well on text similar to natural language text.

· Information-based: This is very similar to the model used by Divergence from randomness. In order to use this similarity in Elasticsearch, you need to use the IB name. Similar to the DFR similarity, it is said that the information-based model performs well on data similar to natural language text.

· LM Dirichlet: This similarity model uses Bayesian smoothing with Dirichlet priors. To use this similarity, we need to use the LMDirichlet name. More information about it can be found athttps://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/LMDirichletSimilarity.html.

· LM Jelinek Mercer: This similarity model is based on the Jelinek Mercer smoothing method. To use this similarity, we need to use the LMJelinekMercer name. More information about it can be found athttps://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/LMJelinekMercerSimilarity.html.

Note

All the mentioned similarity models require mathematical knowledge to fully understand them and a deep explanation of these models is far beyond the scope of this book. However, if you would like to explore these models and increase your knowledge about them, please go to http://en.wikipedia.org/wiki/Okapi_BM25 for the Okapi BM25 similarity and http://terrier.org/docs/v3.5/dfr_description.html for divergence from the randomness similarity.

Setting a per-field similarity

Since Elasticsearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let's assume that we have the following simple mappings that we use in order to index blog posts (stored in theposts_no_similarity.json file):

{

"mappings" : {

"post" : {

"properties" : {

"id" : { "type" : "long", "store" : "yes" },

"name" : { "type" : "string", "store" : "yes", "index" : "analyzed" },

"contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }

}

}

}

}

What we would like to do is use the BM25 similarity model for the name field and the contents field. In order to do this, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json file) would look like this:

{

"mappings" : {

"post" : {

"properties" : {

"id" : { "type" : "long", "store" : "yes" },

"name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" },

"contents" : { "type" : "string", "store" : "no", "index" : "analyzed", "similarity" : "BM25" }

}

}

}

}

That's all; nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and contents fields.

Note

Please note that in the case of the Divergence from randomness and Information-based similarities, we need to configure some additional properties to specify these similarities' behavior. How to do that is covered in the next part of the current section.

Similarity model configuration

As we now know how to set the desired similarity for each field in our index, it's time to see how to configure them if we need them, which is actually pretty easy. What we need to do is use the index settings section to provide an additional similarity section, for example, like this (this example is stored in the posts_custom_similarity.json file):

{

"settings" : {

"index" : {

"similarity" : {

"mastering_similarity" : {

"type" : "default",

"discount_overlaps" : false

}

}

}

},

"mappings" : {

"post" : {

"properties" : {

"id" : { "type" : "long", "store" : "yes" },

"name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "mastering_similarity" },

"contents" : { "type" : "string", "store" : "no", "index" : "analyzed" }

}

}

}

}

You can, of course, have more than one similarity configuration, but let's focus on the preceding example. We've defined a new similarity model named mastering_similarity, which is based on the default similarity, which is the TF/IDF one. We've set thediscount_overlaps property to false for this similarity, and we've used it as the similarity for the name field. We'll talk about what properties can be used for different similarities further in this section. Now, let's see how to change the default similarity model Elasticsearch will use.

Choosing the default similarity model

In order to change the similarity model used by default, we need to provide a configuration of a similarity model that will be called default. For example, if we would like to use our mastering_similarity "name" as the default one, we would have to change the preceding configuration to the following one (the whole example is stored in the posts_default_similarity.json file):

{

"settings" : {

"index" : {

"similarity" : {

"default" : {

"type" : "default",

"discount_overlaps" : false

}

}

}

},

...

}

Because of the fact that the query norm and coordination factors (which were explained in the Default Apache Lucene scoring explained section in Chapter 2, Power User Query DSL) are used by all similarity models globally and are taken from the default similarity, Elasticsearch allows us to change them when needed. To do this, we need to define another similarity—one called base. It is defined exactly the same as what we've shown previously, but instead of setting its name to default, we set it to base, just like this (the whole example is stored in the posts_base_similarity.json file):

{

"settings" : {

"index" : {

"similarity" : {

"base" : {

"type" : "default",

"discount_overlaps" : false

}

}

}

},

...

}

If the base similarity is present in the index configuration, Elasticsearch will use it to calculate the query norm and coord factors when calculating the score using other similarity models.

Configuring the chosen similarity model

Each of the newly introduced similarity models can be configured to match our needs. Elasticsearch allows us to use the default and BM25 similarities without any configuration, because they are preconfigured for us. In the case of DFR and IB, we need to provide the configuration in order to use them. Let's now see what properties each of the similarity models' implementation provides.

Configuring the TF/IDF similarity

In the case of the TF/IDF similarity, we are allowed to set only a single parameter—discount_overlaps, which defaults to true. By default, the tokens that have their position increment set to 0 (and therefore, are placed at the same position as the one before them) will not be taken into consideration when calculating the score. If we want them to be taken into consideration, we need to configure the similarity with the discount_overlaps property set to false.

Configuring the Okapi BM25 similarity

In the case of the Okapi BM25 similarity, we have these parameters: we can configure k1 (controls the saturation—nonlinear term frequency normalization) as a float value, b (controls how the document length affects the term frequency values) as a float value, anddiscount_overlaps, which is exactly the same as in TF/IDF similarity.

Configuring the DFR similarity

In the case of the DFR similarity, we have these parameters that we can configure: basic_model (which can take the value be, d, g, if, in, or ine), after_effect (with values of no, b, and l), and the normalization (which can be no, h1, h2, h3, or z). If we choose a normalization other than no, we need to set the normalization factor. Depending on the chosen normalization, we should use normalization.h1.c (the float value) for the h1 normalization, normalization.h2.c (the float value) for the h2 normalization, normalization.h3.c(the float value) for the h3 normalization, and normalization.z.z (the float value) for the z normalization. For example, this is what the example similarity configuration could look like:

"similarity" : {

"esserverbook_dfr_similarity" : {

"type" : "DFR",

"basic_model" : "g",

"after_effect" : "l",

"normalization" : "h2",

"normalization.h2.c" : "2.0"

}

}

Configuring the IB similarity

In the case of the IB similarity, we have these parameters that we can configure: the distribution property (which can take the value of ll or spl) and the lambda property (which can take the value of df or tff). In addition to this, we can choose the normalization factor, which is the same as the one used for the DFR similarity, so we'll omit describing it for the second time. This is what the example IB similarity configuration could look like:

"similarity" : {

"esserverbook_ib_similarity" : {

"type" : "IB",

"distribution" : "ll",

"lambda" : "df",

"normalization" : "z",

"normalization.z.z" : "0.25"

}

}

Configuring the LM Dirichlet similarity

In the case of the LM Dirichlet similarity, we have the mu property that we can configure the mu property, which is by default set to 2000. An example configuration of this could look as follows:

"similarity" : {

"esserverbook_lm_dirichlet_similarity" : {

"type" : "LMDirichlet",

"mu" : "1000"

}

}

Configuring the LM Jelinek Mercer similarity

When it comes to the LM Jelinek Mercer similarity, we can configure the lambda property, which is set to 0.1 by default. An example configuration of this could look as follows:

"similarity" : {

"esserverbook_lm_jelinek_mercer_similarity" : {

"type" : "LMJelinekMercer",

"lambda" : "0.7"

}

}

Note

It is said that for short fields (like the document title) the optimal lambda value is around 0.1, while for long fields the lambda should be set to 0.7.

Choosing the right directory implementation – the store module

The store module is one of the modules that we usually don't pay much attention to when configuring our cluster; however, it is very important. It is an abstraction between the I/O subsystem and Apache Lucene itself. All the operation that Lucene does with the hard disk drive is done using the store module. Most of the store types in Elasticsearch are mapped to an appropriate Apache Lucene Directory class (http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/store/Directory.html). The directory is used to access all the files the index is built of, so it is crucial to properly configure it.

The store type

Elasticsearch exposes five store types that we can use. Let's see what they provide and how we can leverage their features.

The simple filesystem store

The simplest implementation of the Directory class that is available is implemented using a random access file (Java RandomAccessFile—http://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html) and maps to SimpleFSDirectory(http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/store/SimpleFSDirectory.html) in Apache Lucene. It is sufficient for very simple applications. However, the main bottleneck will be multithreaded access, which has poor performance. In the case of Elasticsearch, it is usually better to use the new I/O-based system store instead of the Simple filesystem store. However, if you would like to use this system store, you should set index.store.type to simplefs.

The new I/O filesystem store

This store type uses the Directory class implementation based on the FileChannel class (http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html) from java.nio package and maps to NIOFSDirectory in Apache Lucene (http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/store/NIOFSDirectory.html). The discussed implementation allows multiple threads to access the same files concurrently without performance degradation. In order to use this store, one should setindex.store.type to niofs.

Note

Please remember that because of some bugs that exist in the JVM machine for Microsoft Windows, it is very probable that the new I/O filesystem store will suffer from performance problems when running on Microsoft Windows. More information about this bug can be found at http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6265734.

The MMap filesystem store

This store type uses Apache Lucene's MMapDirectory (http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/store/MMapDirectory.html) implementation. It uses the mmap system call (http://en.wikipedia.org/wiki/Mmap) for reading, and it uses random access files for writing. It uses a portion of the available virtual memory address space in the process equal to the size of the file being mapped. It doesn't have any locking, so it is scalable when it comes to multithread access. When using mmap to read index files for the operating system, it looks like it is already cached (it was mapped to the virtual space). Because of this, when reading a file from the Apache Lucene index, this file doesn't need to be loaded into the operating system cache and thus, the access is faster. This basically allows Lucene and thus Elasticsearch to directly access the I/O cache, which should result in fast access to index files.

It is worth noting that the MMap filesystem store works best on 64-bit environments and should only be used on 32-bit machines when you are sure that the index is small enough and the virtual address space is sufficient. In order to use this store, one should setindex.store.type to mmapfs.

The hybrid filesystem store

Introduced in Elasticsearch 1.3.0, the hybrid file store uses both NIO and MMap access depending on the file type. A the time of writing this, only term dictionary and doc values were read and written using MMap, and all the other files of the index were opened using NIOFSDirectory. In order to use this store, one should set index.store.type to default.

The memory store

This is the second store type that is not based on the Apache Lucene Directory (the first one is the hybrid filesystem store). The memory store allows us to store all the index files in the memory, so the files are not stored on the disk. This is crucial, because it means that the index data is not persistent—it will be removed whenever a full cluster restart will happen. However, if you need a small, very fast index that can have multiple shards and replicas and can be rebuilt very fast, the memory store type may be the thing you are looking for. In order to use this store, one should set index.store.type to memory.

Note

The data stored in the memory store, like all the other stores, is replicated among all the nodes that can hold data.

Additional properties

When using the memory store type, we also have some degree of control over the caches, which are very important when using the memory store. Please remember that all the following settings are set per node:

· cache.memory.direct: This defaults to true and specifies whether the memory store should be allocated outside of the JVM heap memory. It is usually a good idea to leave it to the default value so that the heap is not overloaded with data.

· cache.memory.small_buffer_size: This defaults to 1kb and defines a small buffer size—the internal memory structure used to hold segments' information and deleted documents' information.

· cache.memory.large_buffer_size: This defaults to 1mb and defines a large buffer size—the internal memory structure used to hold index files other than segments' information and deleted documents.

· cache.memory.small_cache_size: The objects' small cache size—the internal memory structure used for the caching of index segments' information and deleted documents' information. It defaults to 10mb.

· cache.memory.large_cache_size: The objects' large cache size—the internal memory structure used to cache information about the index other than the index segments' information and deleted documents' information. It defaults to 500mb.

The default store type

There are differences when it comes to the default store of Elasticsearch 1.3.0 and the newer and older versions.

The default store type for Elasticsearch 1.3.0 and higher

Starting from Elasticsearch 1.3.0, the new default Elasticsearch store type is the hybrid one that we can choose by setting index.store.type to default.

The default store type for Elasticsearch versions older than 1.3.0

By default, Elasticsearch versions older than 1.3.0 use filesystem-based storage. However different store types are chosen for different operating systems. For example, for a 32-bit Microsoft Windows system, the simplefs type will be used; mmapfs will be used when Elasticsearch is running on Solaris and Microsoft Windows 64 bit, and niofs will be used for the rest of the world.

Note

If you are looking for some information from experts on how they see which Directory implementation to use, please look at the http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html post written by Uwe Schindler andhttp://jprante.github.io/lessons/2012/07/26/Mmap-with-Lucene.html by Jörg Prante.

Usually, the default store type will be the one that you want to use. However, sometimes, it is worth considering using the MMap file system store type, especially when you have plenty of memory and your indices are big. This is because when using mmap to access the index file, it will cause the index files to be cached only once and be reused both by Apache Lucene and the operating system.

NRT, flush, refresh, and transaction log

In an ideal search solution, when new data is indexed, it is instantly available for searching. When you start Elasticsearch, this is exactly how it works even in distributed environments. However, this is not the whole truth, and we will show you why it is like this.

Let's start by indexing an example document to the newly created index using the following command:

curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }'

Now, let's replace this document, and let's try to find it immediately. In order to do this, we'll use the following command chain:

curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test2" }' ; curl -XGET 'localhost:9200/test/test/_search?pretty'

The preceding command will probably result in a response that is very similar to the following one:

{"_index":"test","_type":"test","_id":"1","_version":2,"created":false}{

"took" : 1,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 1,

"max_score" : 1.0,

"hits" : [ {

"_index" : "test",

"_type" : "test",

"_id" : "1",

"_score" : 1.0,

"_source":{ "title": "test" }

} ]

}

}

We see two responses glued together. The first line starts with a response to the indexing command—the first command we've sent. As you can see, everything is correct—we've updated the document (look at _version). With the second command, our search query should return the document with the title field set to test2; however, as you can see, it returned the first document. What happened? Before we give you the answer to this question, we will take a step back and discuss how the underlying Apache Lucene library makes the newly indexed documents available for searching.

Updating the index and committing changes

As we already know from the Introducing Apache Lucene section in Chapter 1, Introduction to Elasticsearch, during the indexing process, new documents are written into segments. The segments are independent indices, which means that queries that are run in parallel to indexing should add newly created segments from time to time to the set of these segments that are used for searching. Apache Lucene does this by creating subsequent (because of the write-once nature of the index) segments_N files, which list segments in the index. This process is called committing. Lucene can do this in a secure way—we are sure that all changes or none of them hit the index. If a failure happens, we can be sure that the index will be in a consistent state.

Let's return to our example. The first operation adds the document to the index but doesn't run the commit command to Lucene. This is exactly how it works. However, a commit is not enough for the data to be available for searching. The Lucene library uses an abstraction class called Searcher to access the index, and this class needs to be refreshed.

After a commit operation, the Searcher object should be reopened in order for it to be able to see the newly created segments. This whole process is called refresh. For performance reasons, Elasticsearch tries to postpone costly refreshes and, by default, refresh is not performed after indexing a single document (or a batch of them), but the Searcher is refreshed every second. This happens quite often, but sometimes, applications require the refresh operation to be performed more often than once every second. When this happens, you can consider using another technology, or the requirements should be verified. If required, there is a possibility of forcing the refresh by using the Elasticsearch API. For example, in our example, we can add the following command:

curl -XGET localhost:9200/test/_refresh

If we add the preceding command before the search, Elasticsearch would respond as we had expected.

Changing the default refresh time

The time between automatic Searcher refresh operations can be changed by using the index.refresh_interval parameter either in the Elasticsearch configuration file or by using the Update Settings API, for example:

curl -XPUT localhost:9200/test/_settings -d '{

"index" : {

"refresh_interval" : "5m"

}

}'

The preceding command will change the automatic refresh to be performed every 5 minutes. Please remember that the data that is indexed between refreshes won't be visible by queries.

Note

As we said, the refresh operation is costly when it comes to resources. The longer the period of the refresh, the faster your indexing will be. If you are planning for a very high indexing procedure when you don't need your data to be visible until the indexing ends, you can consider disabling the refresh operation by setting the index.refresh_interval parameter to -1 and setting it back to its original value after the indexing is done.

The transaction log

Apache Lucene can guarantee index consistency and all or nothing indexing, which is great. However, this fact cannot ensure us that there will be no data loss when failure happens while writing data to the index (for example, when there isn't enough space on the device, the device is faulty, or there aren't enough file handlers available to create new index files). Another problem is that frequent commit is costly in terms of performance (as you may recall, a single commit will trigger a new segment creation, and this can trigger the segments to merge). Elasticsearch solves these issues by implementing the transaction log. The transaction log holds all uncommitted transactions and, from time to time, Elasticsearch creates a new log for subsequent changes. When something goes wrong, the transaction log can be replayed to make sure that none of the changes were lost. All of these tasks are happening automatically, so the user may not be aware of the fact that the commit was triggered at a particular moment. In Elasticsearch, the moment where the information from the transaction log is synchronized with the storage (which is the Apache Lucene index) and the transaction log is cleared is called flushing.

Note

Please note the difference between flush and refresh operations. In most of the cases, refresh is exactly what you want. It is all about making new data available for searching. On the other hand, the flush operation is used to make sure that all the data is correctly stored in the index and the transaction log can be cleared.

In addition to automatic flushing, it can be forced manually using the flush API. For example, we can run a command to flush all the data stored in the transaction log for all indices by running the following command:

curl -XGET localhost:9200/_flush

Or, we can run the flush command for the particular index, which in our case is the one called library:

curl -XGET localhost:9200/library/_flush

curl -XGET localhost:9200/library/_refresh

In the second example, we used it together with the refresh, which after flushing the data, opens a new searcher.

The transaction log configuration

If the default behavior of the transaction log is not enough, Elasticsearch allows us to configure its behavior when it comes to the transaction log handling. The following parameters can be set in the elasticsearch.yml file as well as using index settings' Update API to control the transaction log 'margin-left:18.0pt;text-indent:-18.0pt;line-height: normal'>· index.translog.flush_threshold_period: This defaults to 30 minutes (30m). It controls the time after which the flush will be forced automatically even if no new data was being written to it. In some cases, this can cause a lot of I/O operation, so sometimes it's better to perform the flush more often with less data stored in it.

· index.translog.flush_threshold_ops: This specifies the maximum number of operations after which the flush operation will be performed. By default, Elasticsearch does not limit these operations.

· index.translog.flush_threshold_size: This specifies the maximum size of the transaction log. If the size of the transaction log is equal to or greater than the parameter, the flush operation will be performed. It defaults to 200 MB.

· index.translog.interval: This defaults to 5s and describes the period between consecutive checks if the flush is needed. Elasticsearch randomizes this value to be greater than the defined one and less than double of it.

· index.gateway.local.sync: This defines how often the transaction log should be sent to the disk using the fsync system call. The default is 5s.

· index.translog.disable_flush: This option allows us to disable the automatic flush. By default, flushing is enabled, but sometimes, it is handy to disable it temporarily, for example, during the import of a large amount of documents.

Note

All of the mentioned parameters are specified for an index of our choice, but they define the behavior of the transaction log for each of the index shards.

In addition to setting the previously mentioned properties in the elasticsearch.yml file, we can also set them by using the Settings Update API. For example, the following command will result in disabling flushing for the test index:

curl -XPUT localhost:9200/test/_settings -d '{

"index" : {

"translog.disable_flush" : true

}

}'

The previous command was run before the import of a large amount of data, which gave us a performance boost for indexing. However, one should remember to turn on flushing when the import is done.

Near real-time GET

Transaction logs give us one more feature for free, that is, the real-time GET operation, which provides us with the possibility of returning the previous version of the document, including noncommitted versions. The real-time GET operation fetches data from the index, but first, it checks whether a newer version of this document is available in the transaction log. If there is no flushed document, the data from the index is ignored and a newer version of the document is returned—the one from the transaction log.

In order to see how it works, you can replace the search operation in our example with the following command:

curl -XGET localhost:9200/test/test/1?pretty

Elasticsearch should return a result similar to the following:

{

"_index" : "test",

"_type" : "test",

"_id" : "1",

"_version" : 2,

"exists" : true, "_source" : { "title": "test2" }

}

If you look at the result, you would see that, again, the result was just as we expected and no trick with refresh was required to obtain the newest version of the document.

Segment merging under control

As you already know (we've discussed it throughout Chapter 1, Introduction to Elasticsearch), every Elasticsearch index is built out of one or more shards and can have zero or more replicas. You also know that each of the shards and replicas are actual Apache Lucene indices that are built of multiple segments (at least one segment). If you recall, the segments are written once and read many times, and data structures, apart from the information about the deleted documents that are held in one of the files, can be changed. After some time, when certain conditions are met, the contents of some segments can be copied to a bigger segment, and the original segments are discarded and thus deleted from the disk. Such an operation is called segment merging.

You may ask yourself, why bother about segment merging? There are a few reasons. First of all, the more segments the index is built of, the slower the search will be and the more memory Lucene will need. In addition to this, segments are immutable, so the information is not deleted from it. If you happen to delete many documents from your index, until the merge happens, these documents are only marked as deleted and are not deleted physically. So, when segment merging happens, the documents that are marked as deleted are not written into the new segment, and this way, they are removed, which decreases the final segment size.

Note

Many small changes can result in a large number of small segments, which can lead to problems with a large number of opened files. We should always be prepared to handle such situations, for example, by having the appropriate opened files' limit set.

So, just to quickly summarize, segments merging takes place and from the user's point of view, it will result in two effects:

· It will reduce the number of segments in order to allow faster searching when a few segments are merged into a single one

· It will reduce the size of the index because of removing the deleted documents when the merge is finalized

However, you have to remember that segment merging comes with a price: the price of I/O operations, which can affect performance on slower systems. Because of this, Elasticsearch allows us to choose the merge policy and the store level throttling.

Choosing the right merge policy

Although segment merging is Apache Lucene's duty, Elasticsearch allows us to configure which merge policy we would like to use. There are three policies that we are currently allowed to use:

· tiered (the default one)

· log_byte_size

· log_doc

Each of the preceding mentioned policies have their own parameters, which define their behavior and the default values that we can override (please look at the section dedicated to the policy of your choice to see what those parameters are).

In order to tell Elasticsearch which merge policy we want to use, we should set index.merge.policy.type to the desired type, shown as follows:

index.merge.policy.type: tiered

Note

Once the index is created with the specified merge policy type, it can't be changed. However, all the properties defining the merge policy behavior can be changed using the Index Update API.

Let's now look at the different merge policies and the functionality that they provide. After this, we will discuss all the configuration options provided by the policies.

The tiered merge policy

The tiered merge policy is the default merge policy that Elasticsearch uses. It merges segments of approximately similar size, taking into account the maximum number of segments allowed per tier. It is also possible to differentiate the number of segments that are allowed to be merged at once from how many segments are allowed to be present per tier. During indexing, this merge policy will compute how many segments are allowed to be present in the index, which is called budget. If the number of segments the index is built of is higher than the computed budget, the tiered policy will first sort the segments by the decreasing order of their size (taking into account the deleted documents). After that, it will find the merge that has the lowest cost. The merge cost is calculated in a way that merges are reclaiming more deletions, and having a smaller size is favored.

If the merge produces a segment that is larger than the value specified by the index.merge.policy.max_merged_segment property, the policy will merge fewer segments to keep the segment size under the budget. This means that for indices that have large shards, the default value of the index.merge.policy.max_merged_segment property may be too low and will result in the creation of many segments, slowing down your queries. Depending on the volume of your data, you should monitor your segments and adjust the merge policy setting to match your needs.

The log byte size merge policy

The log byte size merge policy is a merge policy, which over time, will produce an index that will be built of a logarithmic size of indices. There will be a few large segments, then there will be a few merge factor smaller segments, and so on. You can imagine that there will be a few segments of the same level of size when the number of segments will be lower than the merge factor. When an extra segment is encountered, all the segments within that level are merged. The number of segments an index will contain is proportional to the logarithm of the next size in bytes. This merge policy is generally able to keep the low number of segments in your index while minimizing the cost of segments merging.

The log doc merge policy

The log doc merge policy is similar to the log_byte_size merge policy, but instead of operating on the actual segment size in bytes, it operates on the number of documents in the index. This merge policy will perform well when the documents are similar in terms of size or if you want segments of similar sizes in terms of the number of documents.

Merge policies' configuration

We now know how merge policies work, but we lack the knowledge about the configuration options. So now, let's discuss each of the merge policies and see what options are exposed to us. Please remember that the default values will usually be OK for most of the deployments and they should be changed only when needed.

The tiered merge policy

When using the tiered merge policy, the following options can be altered:

· index.merge.policy.expunge_deletes_allowed: This defaults to 10 and specifies the percentage of deleted documents in a segment in order for it to be considered to be merged when running expungeDeletes.

· index.merge.policy.floor_segment: This is a property that enables us to prevent the frequent flushing of very small segments. Segments smaller than the size defined by this property are treated by the merge mechanism, as they would have the size equal to the value of this property. It defaults to 2MB.

· index.merge.policy.max_merge_at_once: This specifies the maximum number of segments that will be merged at the same time during indexing. By default, it is set to 10. Setting the value of this property to higher values can result in multiple segments being merged at once, which will need more I/O resources.

· index.merge.policy.max_merge_at_once_explicit: This specifies the maximum number of segments that will be merged at the same time during the optimize operation or expungeDeletes. By default, this is set to 30. This setting will not affect the maximum number of segments that will be merged during indexing.

· index.merge.policy.max_merged_segment: This defaults to 5GB and specifies the maximum size of a single segment that will be produced during segment merging when indexing. This setting is an approximate value, because the merged segment size is calculated by summing the size of segments that are going to be merged minus the size of the deleted documents in these segments.

· index.merge.policy.segments_per_tier: This specifies the allowed number of segments per tier. Smaller values of this property result in less segments, which means more merging and lower indexing performance. It defaults to 10 and should be set to a value higher than or equal to index.merge.policy.max_merge_at_once, or you'll be facing too many merges and performance issues.

· index.reclaim_deletes_weight: This defaults to 2.0 and specifies how many merges that reclaim deletes are favored. When setting this value to 0.0, the reclaim deletes will not affect the merge selection. The higher the value, the more favored the merge that reclaims deletes will be.

· index.compund_format: This is a Boolean value that specifies whether the index should be stored in a compound format or not. It defaults to false. If set to true, Lucene will store all the files that build the index in a single file. Sometimes, this is useful for systems running constantly out of file handlers, but it will decrease the searching and indexing performance.

The log byte size merge policy

When using the log_byte_size merge policy, the following options can be used to configure its 'margin-left:18.0pt;text-indent:-18.0pt;line-height: normal'>· merge_factor: This specifies how often segments are merged during indexing. With a smaller merge_factor value, the searches are faster and less memory is used, but this comes with the cost of slower indexing. With larger merge_factor values, it is the opposite—the indexing is faster (because of less merging being done), but the searches are slower and more memory is used. By default, merge_factor is given the value of 10. It is advised to use larger values of merge_factor for batch indexing and lower values of this parameter for normal index maintenance.

· min_merge_size: This defines the size (the total size of the segment files in bytes) of the smallest segment possible. If a segment is lower in size than the number specified by this property, it will be merged if the merge_factor property allows us to do that. This property defaults to 1.6MB and is very useful in order to avoid having many very small segments. However, one should remember that setting this property to a large value will increase the merging cost.

· max_merge_size: This defines the maximum size (the total size of the segment files in bytes) of the segment that can be merged with other segments. By default, it is not set, so there is no limit on the maximum size a segment can be in order to be merged.

· maxMergeDocs: This defines the maximum number of documents a segment can have in order to be merged with other segments. By default, it is not set, so there is no limit to the maximum number of documents a segment can have.

· calibrate_size_by_deletes: This is a Boolean value, which is set to true and specifies whether the size of the deleted documents should be taken into consideration when calculating the segment size.

The mentioned properties we just saw should be prefixed with the index.merge.policy prefix. So if we would like to set the min_merge_docs property, we should use the index.merge.policy.min_merge_docs property.

In addition to this, the log_byte_size merge policy accepts the index.merge.async and index.merge.async_interval properties just like the tiered merge policy does.

The log doc merge policy

When using the log_doc merge policy, the following options can be used to configure its 'margin-left:18.0pt;text-indent:-18.0pt;line-height: normal'>· merge_factor: This is same as the property that is present in the log_byte_size merge policy, so please refer to this policy for the explanation.

· min_merge_docs: This defines the minimum number of documents for the smallest segment. If a segment contains a lower document count than the number specified by this property, it will be merged if the merge_factor property allows this. This property defaults to 1000 and is very useful in order to avoid having many very small segments. However, one should remember that setting this property to a large value will increase the merging cost.

· max_merge_docs: This defines the maximum number of documents a segment can have in order to be merged with other segments. By default, it is not set, so there is no limit to the maximum number of documents a segment can have.

· calibrate_size_by_deletes: This is a Boolean value that defaults to true and specifies whether the size of deleted documents should be taken into consideration when calculating the segment size.

Similar to the previous merge policy, the previously mentioned properties should be prefixed with the index.merge.policy prefix. So if we would like to set the min_merge_docs property, we should use the index.merge.policy.min_merge_docs property.

Scheduling

In addition to having control over how the merge policy is behaving, Elasticsearch allows us to define the execution of the merge policy once a merge is needed. There are two merge schedulers available, with the default being ConcurrentMergeScheduler.

The concurrent merge scheduler

This is a merge scheduler that will use multiple threads in order to perform segments' merging. This scheduler will create a new thread until the maximum number of threads is reached. If the maximum number of threads is reached and a new thread is needed (because segments' merge needs to be performed), all the indexing will be paused until at least one merge is completed.

In order to control the maximum threads allowed, we can alter the index.merge.scheduler.max_thread_count property. By default, it is set to the value calculated by the following equation:

maximum_value(1, minimum_value(3, available_processors / 2)

So, if our system has eight processors available, the maximum number of threads that the concurrent merge scheduler is allowed to use will be equal to four.

You should also remember that this is especially not good for spinning disks. You want to be sure that merging won't saturate your disks' throughput. Because of this, if you see extensive merging, you should lower the number of merging threads. It is usually said that for spinning disks, the number of threads used by the concurrent merge scheduler should be set to 1.

The serial merge scheduler

A simple merge scheduler uses the same thread for merging. It results in a merge that stops all the other document processing that was happening on the same thread, which in this case, means the stopping of indexing. This merge scheduler is only provided for backwards compatibility and, in fact, uses the concurrent merge scheduler with the number of threads equal to one.

Setting the desired merge scheduler

In order to set the desired merge scheduler, one should set the index.merge.scheduler.type property to the value of concurrent or serial. For example, in order to use the concurrent merge scheduler, one should set the following property:

index.merge.scheduler.type: concurrent

In order to use the serial merge scheduler, one should set the following property:

index.merge.scheduler.type: serial

Note

When talking about the merge policy and merge schedulers, it would be nice to visualize them. If one needs to see how the merges are done in the underlying Apache Lucene library, we suggest that you visit Mike McCandless' blog post athttp://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html.

In addition to this, there is a plugin that allows us to see what is happening to the segments called SegmentSpy. Refer to the following URL for more information:

https://github.com/polyfractal/elasticsearch-segmentspy

When it is too much for I/O – throttling explained

In the Choosing the right directory implementation section, we've talked about the store type, which means we are now able to configure the store module to match our needs. However, we didn't write everything about the store module—we didn't write about throttling.

Controlling I/O throttling

As you remember from the Segment merging under control section, Apache Lucene stores the data in immutable segment files that can be read many times but can be written only once. The merge process is asynchronous and, in general, it should not interfere with indexing and searching, looking from the Lucene point of view. However, problems may occur because merging is expensive when it comes to I/O—it requires you to read the segments that are going to be merged and write new ones. If searching and indexing happen concurrently, this can be too much for the I/O subsystem, especially on systems with low I/O. This is where throttling kicks in—we can control how much I/O Elasticsearch will use.

Configuration

Throttling can be configured both on a node-level and on the index-level, so you can either configure how many resources a node will use or how many resources will be used for the index.

The throttling type

In order to configure the throttling type on the node-level, one should use the indices.store.throttle.type property, which can take the value of none, merge, and all. The none value will tell Elasticsearch that no limiting should take place. The merge value tells Elasticsearch that we want to limit the I/O usage for the merging of nodes (and it is the default value) and the all value specifies that we want to limit all store module-based operations.

In order to configure the throttling type on the index-level, one should use the index.store.throttle.type property, which can take the same values as the indices.store.throttle.type property with an additional one— node. The node value will tell Elasticsearch that instead of using per-index throttling limiting, we will use the node-level configuration. This is the default value.

Maximum throughput per second

In both cases, when using index or node-level throttling, we are able to set the maximum bytes per second that I/O can use. For the value of this property, we can use 10mb, 500mb, or anything that we need. For the index-level configuration, we should use theindex.store.throttle.max_bytes_per_sec property and for the node-level configuration, we should use indices.store.throttle.max_bytes_per_sec.

Note

The previously mentioned properties can be set both in the elasticsearch.yml file and can also be updated dynamically using the cluster update settings for the node-level configuration and using the index update settings for the index-level configuration.

Node throttling defaults

On the node-level, since Elasticsearch 0.90.1, throttling is enabled by default. The indices.store.throttle.type property is set to merge and the indices.store.throttle.max_bytes_per_sec property is set to 20mb. Elasticsearch versions before 0.90.1 don't have throttling enabled by default.

Performance considerations

When using SSD (solid state drives) or when query speed matters only a little (or you are not searching when you index your data), it is worth considering disabling throttling completely. We can do this by setting the indices.store.throttle.type property to none. This causes Elasticsearch to not use any store-level throttling and use full disk throughput for store-based operations.

The configuration example

Now, let's imagine that we have a cluster that consists of four Elasticsearch nodes and we want to configure throttling for the whole cluster. By default, we want the merge operation not to process more than 50 megabytes per second for a node. We know that we can handle such operations without affecting the search performance, and this is what we are aiming at. In order to achieve this, we would run the following request:

curl -XPUT 'localhost:9200/_cluster/settings' -d '{

"persistent" : {

"indices.store.throttle.type" : "merge",

"indices.store.throttle.max_bytes_per_sec" : "50mb"

}

}'

In addition to this, we have a single index called payments that is very rarely used, and we've placed it in the smallest machine in the cluster. This index doesn't have replicas and is built of a single shard. What we would like to do for this index is limit the merges to process a maximum of 10 megabytes per second. So, in addition to the preceding command, we would run one like this:

curl -XPUT 'localhost:9200/payments/_settings' -d '{

"index.store.throttle.type" : "merge",

"index.store.throttle.max_bytes_per_sec" : "10mb"

}'

After running the preceding commands, we can check our index settings by running the following command:

curl -XGET 'localhost:9200/payments/_settings?pretty'

In response, we should get the following JSON:

{

"payments" : {

"settings" : {

"index" : {

"creation_date" : "1414072648520",

"store" : {

"throttle" : {

"type" : "merge",

"max_bytes_per_sec" : "10mb"

}

},

"number_of_shards" : "5",

"number_of_replicas" : "1",

"version" : {

"created" : "1040001"

},

"uuid" : "M3lePTOvSN2jnDz1J0t4Uw"

}

}

}

}

As you can see, after updating the index setting, closing the index, and opening it again, we've finally got our settings working.

Understanding Elasticsearch caching

One of the very important parts of Elasticsearch, although not always visible to the users, is caching. It allows Elasticsearch to store commonly used data in memory and reuse it on demand. Of course, we can't cache everything—we usually have way more data than we have memory, and creating caches may be quite expensive when it comes to performance. In this chapter, we will look at the different caches exposed by Elasticsearch, and we will discuss how they are used and how we can control their usage. Hopefully, such information will allow you to better understand how this great search server works internally.

The filter cache

The filter cache is the simplest of all the caches available in Elasticsearch. It is used during query time to cache the results of filters that are used in queries. We already talked about filters in section Handling filters and why it matters of Chapter 2, Power User Query DSL, but let's look at a simple example. Let's assume that we have the following query:

{

"query" : {

"filtered" : {

"query" : {

"match_all" : {}

},

"filter" : {

"term" : {

"category" : "romance"

}

}

}

}

}

The preceding query will return all the documents that have the romance term in the category field. As you can see, we've used the match_all query combined with a filter. Now, after the initial query, every query with the same filter present in it will reuse the results of our filter and save the precious I/O and CPU resources.

Filter cache types

There are two types of filter caches available in Elasticsearch: node-level and index-level filter caches. This gives us the possibility of choosing the filter cache to be dependent on the index or on a node (which is the default behavior). As we can't always predict where the given index will be allocated (actually, its shards and replicas), it is not recommended that you use the index-level filter cache because we can't predict the memory usage in such cases.

Node-level filter cache configuration

The default and recommended filter cache type is configured for all shards allocated to a given node (set using the index.cache.filter.type property to the node value or not setting that property at all). Elasticsearch allows us to use the indices.cache.filter.sizeproperty to configure the size of this cache. We can either use a percentage value as 10% (which is the default value), or a static memory value as 1024mb. If we use the percentage value, Elasticsearch will calculate it as a percentage of the maximum heap memory given to a node.

The node-level filter cache is a Least Recently Used cache type (LRU), which means that while removing cache entries, the ones that were used the least number of times will be thrown away in order to make place for the newer entries.

Index-level filter cache configuration

The second type of filter cache that Elasticsearch allows us to use is the index-level filter cache. We can configure its behavior by configuring the following properties:

· index.cache.filter.type: This property sets the type of the cache, which can take the values of resident, soft, weak, and node (the default one). By using this property, Elasticsearch allows us to choose the implementation of the cache. The entries in the residentcache can't be removed by JVM unless we want them to be removed (either by using the API or by setting the maximum size or expiration time) and is basically recommended because of this (filling up the filter cache can be expensive). The soft and weak filter cache types can be cleared by JVM when it lacks memory, with the difference that when clearing up memory, JVM will choose the weaker reference objects first and then choose the one that uses the soft reference. The node value tells Elasticsearch to use the node-level filter cache.

· index.cache.filter.max_size: This property specifies the maximum number of cache entries that can be stored in the filter cache (the default is -1, which means unbounded). You need to remember that this setting is not applied for the whole index but for a single segment of a shard for the index, so the memory usage will differ depending on how many shards (and replicas) there are (for the given index) and how many segments the index contains. Generally, the default, unbounded filter cache is fine with thesoft type and the proper queries that are paying attention in order to make the caches reusable.

· index.cache.filter.expire: This property specifies the expiration time of an entry in the filter cache, which is unbounded (set to -1) by default. If we want our filter cache to expire if not accessed, we can set the maximum time of inactivity. For example, if we would like our cache to expire after 60 minutes, we should set this property to 60m.

Note

If you want to read more about the soft and weak references in Java, please refer to the Java documentation, especially the Javadocs, for these two types: http://docs.oracle.com/javase/8/docs/api/java/lang/ref/SoftReference.html andhttp://docs.oracle.com/javase/8/docs/api/java/lang/ref/WeakReference.html.

The field data cache

The field data cache is used when we want to send queries that involve operations that work on uninverted data. What Elasticsearch needs to do is load all the values for a given field and store that in the memory—you can call this field data cache. This cache is used by Elasticsearch when we use faceting, aggregations, scripting, or sorting on the field value. When first executing an operation that requires data uninverting, Elasticsearch loads all the data for that field into the memory. Yes, that's right; all the data from a given field is loaded into the memory by default and is never removed from it. Elasticsearch does this to be able to provide fast document-based access to values in a field. Remember that the field data cache is usually expensive to build from the hardware resource's point of view, because the data for the whole field needs to be loaded into the memory, and this requires both I/O operations and CPU resources.

Note

One should remember that for every field that we sort on or use faceting on, the data needs to be loaded into the memory each and every term. This can be expensive, especially for the fields that are high cardinality ones: the ones with numerous different terms in them.

Field data or doc values

Lucene doc values and their implementation in Elasticsearch is getting better and better with each release. With the release of Elasticsearch 1.4.0, they are almost, or as fast as, the field data cache. The thing is that doc values are calculated during indexing time and are stored on the disk along with the index, and they don't require as much memory as the field data cache. In fact, they require very little heap space and are almost as fast as the field data cache. If you are using operations that require large amounts of field data cache, you should consider using doc values for such fields. You only need to add the doc_values property and set it to true for such fields, and Elasticsearch will do the rest.

Note

At the time of writing this, Elasticsearch does not allow using doc values on analyzed string fields. You can use doc values with all the other field types.

For example, if we would like to set our year field to use doc values, we would change its configuration to the following one:

"year" : {

"type" : "long",

"ignore_malformed" : false,

"index" : "analyzed",

"doc_values" : true

}

If you reindex your data, Elasticsearch would use the doc values (instead of the field data cache) for the operations that require uninverted data in the year field, for example, aggregations.

Node-level field data cache configuration

Since Elasticsearch 0.90.0, we are allowed to use the following properties to configure the node-level field data cache, which is the default field data cache if we don't alter the configuration:

· indices.fielddata.cache.size: This specifies the maximum size of the field data cache either as a percentage value such as 20%, or an absolute memory size such as 10gb. If we use the percentage value, Elasticsearch will calculate it as a percentage of the maximum heap memory given to a node. By default, the field data cache size is unbounded and should be monitored, as it can consume a vast amount of memory given to the JVM.

· indices.fielddata.cache.expire: This property specifies the expiration time of an entry in the field data cache, which is set to -1 by default, which means that the entries in the cache won't be expired. If we want our field data cache to expire if not accessed, we can set the maximum time of inactivity. For example, if we like our cache to expire after 60 minutes, we should set this property to 60m. Please remember that the field data cache is very expensive to rebuild, and the expiration should be considered with caution.

Note

If we want to be sure that Elasticsearch will use the node-level field data cache, we should set the index.fielddata.cache.type property to the node value or not set that property at all.

Index-level field data cache configuration

Similar to index-level filter cache, we can also use the index-level field data cache, but again, it is not recommended that you do because of the same reasons: it is hard to predict which shards or which indices will be allocated to which nodes. Because of this, we can't predict the amount of memory that will be used for the field data cache for each index, and we can run into memory-related issues when Elasticsearch does the rebalancing, for example.

However, if you know what you are doing and what you want to use—resident or soft field data cache—you can use the index.fielddata.cache.type property and set it to resident or soft. As we already discussed during the filter cache's description, the entries in the resident cache can't be removed by JVM unless we want them to be, and it is basically recommended that you use this cache type when we want to use the index-level field data cache. Rebuilding the field data cache is expensive and will affect the Elasticsearch query's performance. The soft field data cache types can be cleared by JVM when it lacks memory.

The field data cache filtering

In addition to the previously mentioned configuration options, Elasticsearch allows us to choose which field values are loaded into the field data cache. This can be useful in some cases, especially if you remember that sorting, faceting, and aggregations use the field data cache to calculate the results. Elasticsearch allows us to use three types of field data loading filtering: by term frequency, by using regex, or a combination of both methods.

Let's talk about one of the examples where field data filtering can be useful and where you may want to exclude the terms with lower frequency from the results of faceting. For example, we may need to do this because we know that we have some terms in the index that have spelling mistakes, and these are lower cardinality terms for sure. We don't want to bother calculating aggregations for them, so we can remove them from the data, correct them in our data source, or remove them from the field data cache by filtering. This will not only exclude them from the results returned by Elasticsearch, but it will also make the field data memory footprint lower, because less data will be stored in the memory. Now let's look at the filtering possibilities.

Adding field data filtering information

In order to introduce the field data cache filtering information, we need to add an additional object to our mappings field definition: the fielddata object with its child object—filter. So our extended field definition for some abstract tag field would look as follows:

"tag" : {

"type" : "string",

"index" : "not_analyzed",

"fielddata" : {

"filter" : {

...

}

}

}

We will see what to put in the filter object in the upcoming sections.

Filtering by term frequency

Filtering by term frequency allows us to only load the terms that have a frequency higher than the specified minimum (the min parameter) and lower than the specified maximum (the max parameter). The term frequency bounded by the min and max parameters is not specified for the whole index but per segment, which is very important, because these frequencies will differ. The min and max parameters can be specified either as a percentage (for example, 1 percent is 0.01 and 50 percent is 0.5), or as an absolute number.

In addition to this, we can include the min_segment_size property that specifies the minimum number of documents a segment should contain in order to be taken into consideration while building the field data cache.

For example, if we would like to store only the terms that come from segments with at least 100 documents and the terms that have a segment term frequency between 1 percent to 20 percent in the field data cache, we should have mappings similar to the following ones:

{

"book" : {

"properties" : {

"tag" : {

"type" : "string",

"index" : "not_analyzed",

"fielddata" : {

"filter" : {

"frequency" : {

"min" : 0.01,

"max" : 0.2,

"min_segment_size" : 100

}

}

}

}

}

}

}

Filtering by regex

In addition to filtering by the term frequency, we can also filter by the regex expression. In such a case, only the terms that match the specified regex will be loaded into the field data cache. For example, if we only want to load the data from the tag field, which probably has Twitter tags (starting with the # character), we should have the following mappings:

{

"book" : {

"properties" : {

"tag" : {

"type" : "string",

"index" : "not_analyzed",

"fielddata" : {

"filter" : {

"regex" : "^#.*"

}

}

}

}

}

}

Filtering by regex and term frequency

Of course, we can combine the previously discussed filtering methods. So, if we want to have the field data cache responsible for holding the tag field data of only those terms that start with the # character, this comes from a segment with at least 100 documents and has a segment term frequency between 1 to 20 percent; we should have the following mappings:

{

"book" : {

"properties" : {

"tag" : {

"type" : "string",

"index" : "not_analyzed",

"fielddata" : {

"filter" : {

"frequency" : {

"min" : 0.1,

"max" : 0.2,

"min_segment_size" : 100

},

"regex" : "^#.*"

}

}

}

}

}

}

Note

Remember that the field data cache is not built during indexing but can be rebuilt while querying and, because of that, we can change filtering during runtime by updating the fieldata section using the Mappings API. However, one has to remember that after changing the field data loading filtering settings, the cache should be cleared using the clear cache API described in the Clearing the caches section in this chapter.

The filtering example

So now, let's go back to the example from the beginning of the filtering section. What we want to do is exclude the terms with the lowest frequency from faceting results. In our case, the lowest ones are the ones that have the frequency lower than 50 percent. Of course, this frequency is very high, but in our example, we only use four documents. In production, you'd like to have different values: lower ones. In order to do this, we will create a books index with the following commands:

curl -XPOST 'localhost:9200/books' -d '{

"settings" : {

"number_of_shards" : 1,

"number_of_replicas" : 0

},

"mappings" : {

"book" : {

"properties" : {

"tag" : {

"type" : "string",

"index" : "not_analyzed",

"fielddata" : {

"filter" : {

"frequency" : {

"min" : 0.5,

"max" : 0.99

}

}

}

}

}

}

}

}'

Now, let's index some sample documents using the bulk API (the code is stored in the regex.json file provided with the book):

curl -s -XPOST 'localhost:9200/_bulk' --data-binary '

{ "index": {"_index": "books", "_type": "book", "_id": "1"}}

{"tag":["one"]}

{ "index": {"_index": "books", "_type": "book", "_id": "2"}}

{"tag":["one"]}

{ "index": {"_index": "books", "_type": "book", "_id": "3"}}

{"tag":["one"]}

{ "index": {"_index": "books", "_type": "book", "_id": "4"}}

{"tag":["four"]}

'

Now, let's check a simple term's faceting by running the following query (because as we already discussed, faceting and aggregations use the field data cache to operate):

curl -XGET 'localhost:9200/books/_search?pretty' -d ' {

"query" : {

"match_all" : {}

},

"aggregations" : {

"tag" : {

"terms" : {

"field" : "tag"

}

}

}

}'

The response for the preceding query would be as follows:

{

"took" : 1,

"timed_out" : false,

"_shards" : {

"total" : 1,

"successful" : 1,

"failed" : 0

},

.

.

.

"aggregations" : {

"tag" : {

"doc_count_error_upper_bound" : 0,

"sum_other_doc_count" : 0,

"buckets" : [ {

"key" : "one",

"doc_count" : 3 }]

}

}

As you can see, the terms aggregation was only calculated for the one term, and the four term was omitted. If we assume that the four term was misspelled, then we have achieved what we wanted.

Field data formats

Field data cache is not a simple functionality and is implemented to save as much memory as possible. Because of this, Elasticsearch exposes a few formats for the field data cache depending on the data type. We can set the format of the internal data stored in the field data cache by specifying the format property inside a fielddata object for a field, for example:

"tag" : {

"type" : "string",

"fielddata" : {

"format" : "paged_bytes"

}

}

Let's now look at the possible formats.

String-based fields

For string-based fields, Elasticsearch exposes three formats of the field data cache. The default format is paged_bytes, which stores unique occurrences of the terms sequentially and maps documents to these terms. This data is stored in the memory. The second format is fst, which stores the field data cache in a structure called Finite State Transducer (FSThttp://en.wikipedia.org/wiki/Finite_state_transducer), which results in lower memory usage compared to the default format, but it is also slower compared to it. Finally, the third format is doc_values, which results in computing the field data cache entries during indexing and storing them on the disk along with the index files. This format is almost as fast as the default one, but its memory footprint is very low. However, it can't be used with analyzed string fields. Field data filtering is not supported for the doc_values format.

Numeric fields

For numeric-based fields, we have two options when it comes to the format of the field data cache. The default array format stores the data in an in-memory array. The second type of format is doc_values, which uses doc values to store the field data, which means that the field data cache entries will be computed during indexing and stored on the disk along with the index files. Field data filtering is not supported for the doc_values format.

Geographical-based fields

For geo-point based fields, we have options similar to the numeric fields: the default array format, which stores longitudes and latitudes in an array, or doc_values, which uses doc values to store the field data. Of course, field data filtering is not supported for thedoc_values format.

Field data loading

In addition to what we wrote already, Elasticsearch allows us to configure how the field data cache is loaded. As we already mentioned, the field data cache is loaded by default when the cache is needed for the first time—during the first query execution that needs uninverted data. We can change this behavior by including the loading property and setting it to eager. This will make Elasticsearch load the field data cache eagerly whenever new data appears to be loaded into the cache. Therefore, to make the field data cache for the tag field to be loaded eagerly, we would configure it the following way:

"tag" : {

"type" : "string",

"fielddata" : {

"loading" : "eager"

}

}

We can also completely disable the field data cache loading by setting the format property to disabled. For example, to disable loading the field data cache for our tag field, we can change its configuration to the following one:

"tag" : {

"type" : "string",

"fielddata" : {

"format" : "disabled"

}

}

Please note that functionalities that require uninverted data (such as aggregations) won't work on such defined fields.

The shard query cache

A new cache introduced in Elasticsearch 1.4.0 can help with query performance. The shard query cache is responsible for caching local results for each shard. As you remember, when Elasticsearch executes a query, it is sent to all the relevant shards and is executed on them. The results are returned to the node that requested them and are combined. The shard query cache is about caching these partial results on the shard level.

Note

At the time of writing this, the only cached search_type query was count. Therefore, the documents returned by the query will not be cached, but the total number of hits, aggregations, and suggestions returned by each shard will be cached, speeding up proceeding queries. Note that this is likely to be changed in future versions of Elasticsearch.

The shard query cache is not enabled by default. However, we have two options that show you how to enable it. We can do this by adding the index.cache.query.enable property and setting it to true in the settings of our index or by updating the indices settings in real-time with a command like this:

curl -XPUT 'localhost:9200/mastering/_settings' -d '{

"index.cache.query.enable" : true

}'

The second option is to enable the shard query cache per request. We can do this by using the query_cache URI parameter set to true on a per-query basis. The thing to remember is that passing this parameter overwrites the index-level settings. An example request could look as follows:

curl -XGET 'localhost:9200/books/_search?search_type=count&query_cache=true' -d '{

"query" : {

"match_all" : {}

},

"aggregations" : {

"tags" : {

"terms" : {

"field" : "tag"

}

}

}

}'

The good thing about shard query cache is that it is invalidated and updated automatically. Whenever a shard's contents changes, Elasticsearch will update the contents of the cache automatically, so the results of the cached and not cached query will always be the same.

Setting up the shard query cache

By default, Elasticsearch will use up to 1 percent of the heap size given to a node for the shard query cache. This means that all indices present on a node can use up to 1 percent of the total heap memory for the query cache. We can change this by setting theindices.cache.query.size property in the elasticsearch.yml file.

In addition to this, we can control the expiration time of the cache by setting the indices.cache.query.expire property. For example, if we would like the cache to be automatically expired after 60 minutes, we should set the property to 60m.

Using circuit breakers

Because queries can put a lot of pressure on Elasticsearch resources, they allow us to use so-called circuit breakers that prevent Elasticsearch from using too much memory in certain functionalities. Elasticsearch estimates the memory usage and rejects the query execution if certain thresholds are met. Let's look at the available circuit breakers and what they can help us with.

The field data circuit breaker

The field data circuit breaker will prevent request execution if the estimated memory usage for the request is higher than the configured values. By default, Elasticsearch sets indices.breaker.fielddata.limit to 60%, which means that no more than 60 percent of the JVM heap is allowed to be used for the field data cache.

We can also configure the multiplier that Elasticsearch uses for estimates (the estimated values are multiplied by this property value) by using the indices.breaker.fielddata.overhead property. By default, it is set to 1.03.

Note

Please note than before Elasticsearch 1.4.0, indices.breaker.fielddata.limit was called indices.fielddata.breaker.limit and indices.breaker.fielddata.overhead was called indices.fielddatabreaker.overhead.

The request circuit breaker

Introduced in Elasticsearch 1.4.0, the request circuit breaker allows us to configure Elasticsearch to reject the execution of the request if the total estimated memory used by it will be higher than the indices.breaker.request.limit property (set to 40% of the total heap memory assigned to the JVM by default).

Similar to the field data circuit breaker, we can set the overhead by using the indices.breaker.request.overhead property, which defaults to 1.

The total circuit breaker

In addition to the previously described circuit breakers, Elasticsearch 1.4.0 introduced a notion of the total circuit breaker, which defines the total amount of memory that can be used along all the other circuit breakers. We can configure it usingindices.breaker.total.limit, and it defaults to 70% of the JVM heap.

Note

Please remember that all the circuit breakers can be dynamically changed on a working cluster using the Cluster Update Settings API.

Clearing the caches

As we've mentioned earlier, sometimes it is necessary to clear the caches. Elasticsearch allows us to clear the caches using the _cache REST endpoint. Let's look at the usage possibilities.

Index, indices, and all caches clearing

The simplest thing we can do is just clear all the caches by running the following command:

curl -XPOST 'localhost:9200/_cache/clear'

Of course, as we are used to, we can choose a single index or multiple indices to clear the caches for them. For example, if we want to clear the cache for the mastering index, we should run the following command:

curl -XPOST 'localhost:9200/mastering/_cache/clear'

If we want to clear caches for the mastering and books indices, we should run the following command:

curl -XPOST 'localhost:9200/mastering,books/_cache/clear'

Clearing specific caches

By default, Elasticsearch clears all the caches when running the cache clear request. However, we are allowed to choose which caches should be cleared and which ones should be left alone. Elasticsearch allows us to choose the following 'margin-left:18.0pt;text-indent:-18.0pt;line-height: normal'>· Filter caches can be cleared by setting the filter parameter to true. In order to exclude this cache type from the clearing one, we should set this parameter to false. Note that the filter cache is not cleared immediately, but it is scheduled by Elasticsearch to be cleared in the next 60 seconds.

· The field data cache can be cleared by setting the field_data parameter to true. In order to exclude this cache type from the clearing one, we should set this parameter to false.

· To clear the caches of identifiers used for parent-child relationships, we can set the id_cache parameter to true. Setting this property to false will exclude that cache from being cleared.

· The shard query cache can be cleared by setting the query_cache parameter to true. Setting this parameter to false will exclude the shard query cache from being cleared.

For example, if we want all caches apart from the filter and shard query caches for the mastering index, we could run the following command:

curl -XPOST 'localhost:9200/mastering/_cache/clear?field_data=true&filter=false&query_cache=false'

Summary

In this chapter, we started by discussing how to alter the Apache Lucene scoring by using different similarity methods. We altered our index postings format writing by using codecs. We indexed and searched our data in a near real-time manner, and we also learned how to flush and refresh our data. We configured the transaction log and throttled the I/O subsystem. We talked about segment merging and how to visualize it. Finally, we discussed federated search and the usage of tribe nodes in Elasticsearch.

In the next chapter, we will focus on the Elasticsearch administration. We will configure discovery and recovery, and we will use the human-friendly Cat API. In addition to this, we will back up and restore our indices, finalizing what federated search is, and how to search and index data to multiple clusters while still using all the functionalities of Elasticsearch.