Elasticsearch Cluster in Detail - Elasticsearch Server, Second Edition (2014)

Elasticsearch Server, Second Edition (2014)

Chapter 7. Elasticsearch Cluster in Detail

In the previous chapter, we learned more about Elasticsearch's data analysis capabilities. We used aggregations and faceting to add meaning to the data we indexed. We also introduced the spellcheck and autocomplete functionalities to our application by using Elasticsearch suggesters. We've created the alerting functionality by using a percolator, and we've indexed binary files by using the attachment capability. We've indexed and searched geospatial data, and we've used the scroll API to efficiently fetch a large number of results. Finally, we've used the terms lookup to speed up the queries that fetch a list of terms and use them.

By the end of this chapter, you will have learned the following topics:

· Understanding a node's discovery mechanism, configuration, and tuning

· Controlling recovery and gateway modules

· Preparing Elasticsearch for high query and indexing use cases

· Using index templates and dynamic mappings

Node discovery

When you start your Elasticsearch node, one of the first things to occur is that the node starts looking for a master node that has the same cluster name and is visible. If a master is found, the node joins a cluster that is already formed. If no master is found, the node itself is selected as a master (of course, if the configuration allows such a behavior). The process of forming a cluster and finding nodes is called discovery. The module that is responsible for discovery has two main purposes—to elect a master and to discover new nodes within a cluster. In this section, we will discuss how we can configure and tune the discovery module.

Discovery types

By default, without installing additional plugins, Elasticsearch allows us to use the zen discovery, which provides us with multicast and unicast discovery. In computer networking terminology, multicast (http://en.wikipedia.org/wiki/Multicast) is the delivery of a message to a group of computers in a single transmission. On the other hand, we have unicast (http://en.wikipedia.org/wiki/Unicast), which is the transmission of a single message over the network to a single host at once.

Note

When using the multicast discovery, Elasticsearch will try to find all the nodes that are able to receive and respond to the multicast message. If you use the unicast method, you'll need to provide at least some of the hosts that form your cluster and the node will try to connect to them.

When choosing between multicast and unicast, you should be aware whether your network can handle multicast messages. If it can, using multicast will be easier. If your network can't handle multicast, use the unicast type of discovery. The other reason for using the unicast discovery is security—you don't want any nodes to join your cluster by mistake. So, using unicast may be a good choice if you are going to run multiple clusters or your developer machines are in the same network.

Note

If you are using the Linux operating system and want to check if your network supports multicast, please use the ifconfig command for your network interface (usually it will be eth0). If your network supports multicast, you'll see the MULTICAST property in the response from the preceding command.

The master node

As we have already seen, one of the main purposes of discovery is to choose a master node that will be used as a node that will look over the cluster. The master node is the one that checks all the other nodes to see if they are responsive (other nodes ping the master too). The master node will also accept the new nodes that want to join the cluster. If the master is somehow disconnected from the cluster, the remaining nodes will select a new master from among themselves. All these processes are done automatically on the basis of the configuration values we provide.

Configuring the master and data nodes

By default, Elasticsearch allows every node to be a master node and a data node. However, in certain situations, you may want to have worker nodes that will only hold the data and master nodes that will only be used to process requests and manage the cluster. One of these situations is when you have to handle massive amount of data, where data nodes should be as performant as possible. To set the node to only hold data, we need to instruct Elasticsearch that we don't want such a node to be a master node. In order to do this, we will add the following properties to the elasticsearch.yml configuration file:

node.master: false

node.data: true

To set the node to not hold data and only be a master node, we need to instruct Elasticsearch that we don't want such a node to hold data. In order to do this, we add the following properties to the elasticsearch.yml configuration file:

node.master: true

node.data: false

Please note that the node.master and node.data properties are set to true by default, but we tend to include them for the clarity of the configuration.

The master-election configuration

Imagine that you have a cluster built of 10 nodes. Everything is working fine until one day when your network fails and three of your nodes are disconnected from the cluster, but they still see each other. Because of the zen discovery and master-election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name and two master nodes. Such a situation is called a split-brain, and you must avoid it as much as possible. When a split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed.

In order to prevent split-brain situations, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines a minimum amount of the master-eligible nodes that should be connected to each other in order to form a cluster. So now, let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available plus one (which is six in our case), we would end up with a single cluster. Why is that? Before the network failure, we would have 10 nodes, which is more than six nodes and these nodes would form a cluster. After the disconnection of the three nodes, we would still have the first cluster up and running. However, because only three nodes have been disconnected and three is less than six, the remaining three nodes wouldn't be allowed to elect a new master and they would have to wait for reconnection with the original cluster.

Setting the cluster name

If we don't set the cluster.name property in our elasticsearch.yml file, Elasticsearch will use the default value, elasticsearch. This is not always a good thing, and because of this, we suggest that you set the cluster.name property to some other value of your choice. Setting a different cluster.name property is also required if you want to run multiple clusters in a single network; otherwise, you would end up with nodes that belong to different clusters joining together.

Configuring multicast

Multicast is the default zen discovery method. Apart from the common settings, which we will discuss in a moment, there are four properties that we can control and they are as follows:

· discovery.zen.ping.multicast.group: The group address to be used for the multicast requests; it defaults to 224.2.2.4.

· discovery.zen.ping.multicast.port: The port that is used for multicast communication; it defaults to 54328.

· discovery.zen.ping.multicast.ttl: The time for which the multicast request will be considered valid; it defaults to 3 seconds.

· discovery.zen.ping.multicast.address: The address to which Elasticsearch should bind. It defaults to the null value, which means that Elasticsearch will try to bind to all the network interfaces visible by the operating system.

In order to disable multicast, one should add the discovery.zen.ping.multicast.enabled property to the elasticsearch.yml file and set its value to false.

Configuring unicast

Because of the way unicast works, we need to specify at least a single host that the unicast message should be sent to. To do this, we should add the discovery.zen.ping.unicast.hosts property to our elasticsearch.yml configuration file. Basically, we should specify all the hosts that form the cluster in the discovery.zen.ping.unicast.hosts property. We don't have to specify all the hosts; we just need to provide enough so that we are sure that at least one will work. For example, if we would like the 192.168.2.1, 192.168.2.2, and192.168.2.3 hosts for our host, we should specify the preceding property in the following way:

discovery.zen.ping.unicast.hosts: 192.168.2.1:9300, 192.168.2.2:9300, 192.168.2.3:9300

One can also define a range of ports that Elasticsearch can use; for example, to say that the ports from 9300 to 9399 can be used, we would specify the following command line:

discovery.zen.ping.unicast.hosts: 192.168.2.1:[9300-9399], 192.168.2.2:[9300-9399], 192.168.2.3:[9300-9399]

Please note that the hosts are separated with the comma character and we've specified the port on which we expect the unicast messages.

Note

Always set the discovery.zen.ping.multicast.enabled property to false when using unicast.

Ping settings for nodes

In addition to the settings discussed previously, we can control or alter the default ping configuration. Ping is a signal sent between nodes to check whether they are running and responsive. The master node pings all the other nodes in the cluster, and each of the other nodes in the cluster pings the master node. The following properties can be set:

· discovery.zen.fd.ping_interval: This property defaults to 1s (one second) and specifies how often nodes ping each other

· discovery.zen.fd.ping_timeout: This property defaults to 30s (30 seconds) and defines how long a node will wait for a response toits ping message before considering the node as unresponsive

· discovery.zen.fd.ping_retries: This property defaults to 3 and specifies how many retries should be taken before considering a node as not working

If you experience some problems with your network or know that your nodes need more time to see the ping response, you can adjust the preceding values to the ones that are good for your deployment.

The gateway and recovery modules

Apart from our indices and the data indexed inside them, Elasticsearch needs to hold the metadata such as the type mappings and index-level settings. This information needs to be persisted somewhere so that it can be read during the cluster recovery. This is why Elasticsearch introduced the gateway module. You can think about it as a safe haven for your cluster data and metadata. Each time you start your cluster, all the required data is read from the gateway and when you make a change to your cluster, it is persisted using the gateway module.

The gateway

In order to set the type of gateway we want to use, we need to add the gateway.type property to the elasticsearch.yml configuration file and set it to a local value. Currently, Elasticsearch recommends that you use the local gateway type (gateway.type set to local), which is the default. There were additional gateway types in the past (such as the fs, hdfs, and s3), but they are deprecated and will be removed in the future versions. Because of this, we will skip discussing them.

The default local gateway type stores the indices and their metadata in the local file system. Compared to other gateways, the write operation to this gateway is not performed in an asynchronous way. So, whenever a write succeeds, you can be sure that the data was written into the gateway (so, basically it is indexed or stored in the transaction log).

Recovery control

In addition to choosing the gateway type, Elasticsearch allows us to configure when to start the initial recovery process. Recovery is a process of initializing all the shards and replicas, reading all the data from the transaction log, and applying the data on the shards—basically, it's a process needed to start Elasticsearch.

For example, let's imagine that we have a cluster that consists of 10 Elasticsearch nodes. We should inform Elasticsearch about the number of nodes by setting the gateway.expected_nodes property to this value; 10, in our case. We inform Elasticsearch about the amount of expected nodes that are eligible to hold the data and be selected as a master. Elasticsearch will start the recovery process immediately if the number of nodes in the cluster is equal to the gateway.expected_nodes property.

We would also like to start the recovery after eight nodes for the cluster. In order to do this, we should set the gateway.recover_after_nodes property to 8. We could set the value to any value we like. However, we should set it to a value that ensures that the newest version of the cluster state snapshot is available, which usually means that you should start recovery when most of your nodes are available.

However, there is one more thing—we would like the gateway recovery process to start 10 minutes after the cluster was formed, so we set the gateway.recover_after_time property to 10m. This property tells the gateway module how long it should wait with the recovery after the number of nodes specified by the gateway.recover_after_nodes property has formed the cluster. We may want to do this because we know that our network is quite slow and we want the communication between nodes to be stable.

The preceding property values should be set in the elasticsearch.yml configuration file. If we would like to have the preceding value in the mentioned file, we would end up with the following section in the file:

gateway.recover_after_nodes: 8

gateway.recover_after_time: 10m

gateway.expected_nodes: 10

Additional gateway recovery options

In addition to the mentioned options, Elasticsearch allows us some additional degree of control. The additional options are as follows:

· gateway.recover_after_master_nodes: This property is similar to the gateway_recover_after_nodes property. However, instead of taking into consideration all the nodes, it allows us to specify how many nodes that is eligible to be the master should be present in the cluster before the recovery starts.

· gateway.recover_after_data_nodes: This property is also similar to the gateway_recover_after_nodes property, but it allows you to specify how many data nodes should be present in the cluster before the recovery starts.

· gateway.expected_master_nodes: This property is similar to the gateway.expected_nodes property, but instead of specifying the number of nodes that we expect in the cluster, it allows you to specify how many nodes you expect to be present are eligible to be the master.

· gateway.expected_data_nodes: This property is also similar to the gateway.expected_nodes property, but allows you to specify how many data nodes you expect to be present in the cluster.

Preparing Elasticsearch cluster for high query and indexing throughput

Until now, we mostly talked about the different functionalities of Elasticsearch, both in terms of handling queries as well as indexing data. However, preparing Elasticsearch for high query and indexing throughput is something that we would briefly like to talk about. We start this section by mentioning some functionalities of Elasticsearch that we didn't talk until now but are pretty important when it comes to tuning your cluster. We know that it is a very concentrated knowledge, but we will try to limit it to only those things that we think are important. After discussing the functionality, we will give you general advice on how to tune the discussed functionalities and what to pay attention to. We hope that by reading this section you will be able to see which things to look for when you are tuning your cluster.

The filter cache

The filter cache is responsible for caching the filters used in a query. You can retrieve information from the cache very fast. When properly configured, it will speed up querying efficiently, especially the ones that includes filters that were already executed.

Elasticsearch includes two types of filter caches: the node filter cache (the default one) and the index filter cache. The node filter cache is shared across all the indices allocated on a single node and can be configured to use a specific amount of memory or a percentage of the total memory given to Elasticsearch. To specify this value, we should include the node property named indices.cache.filter.size and set it to the desired size or percentage.

The second type of the filter cache is the per index one. In general, one should use the node-level filter cache because it is hard to predict the final size of the per-index filter cache. This is because you usually don't know how many indices will end up on a given node. We will omit further explanation of the per-index filter cache; more information about it can be found in the official documentation and the book, Mastering ElasticSearch, Rafał Kuć and Marek Rogoziński, Packt Publishing.

The field data cache and circuit breaker

The field data cache is a part of Elasticsearch that is used mainly when a query performs sorting or faceting on a field. Elasticsearch loads the data used for such fields to the memory, which allows a fast access to the values on a per document basis. Building the field data cache is expensive, so it is advisable to have enough memory so that the data in this cache is kept in the memory once loaded.

Note

Instead of the field data cache, one can configure a field to use the doc values. Doc values were discussed in the Mappings configuration section in Chapter 2, Indexing Your Data.

The amount of memory field data cache that is allowed to use can be controlled using the indices.fielddata.cache.size property. We can set it to an absolute value (for example, 2GB) or to a percentage of the memory given to an Elasticsearch instance (for example,40%). Please note that these values are per node properties and not per index. Discarding parts of the cache to make room for other entries will result in a poor query performance, so it is advisable to have enough physical memory. Also, remember that by default, the field data cache size is not bounded. So, if we are not careful, we can have our cluster exploded.

We can also control the expiry time for the field data cache; again, by default, the field data cache does not expire. We can control this by using the indices.fielddata.cache.expire property, setting its value to a maximum inactivity time. For example, setting this property to 10m will result in the cache being invalidated after 10 minutes of inactivity. Remember that rebuilding the field data cache is very expensive and in general, you shouldn't set the expiration time.

The circuit breaker

The field data circuit breaker allows the memory estimation that a field will require to be loaded into the memory. By using it, we can prevent loading such fields into the memory by raising an exception. Elasticsearch has two properties to control the behavior of the circuit breaker. First, we have the indices.fielddata.breaker.limit property, which defaults to 80% and can be updated dynamically by using the cluster update settings API. This means that an exception will be raised as soon as our query results in the loading of values for a field that is estimated to take 80 percent or more of the heap memory available to the Elasticsearch process. The second property is indices.fielddata.breaker.overhead, which defaults to 1.03. It defines a constant value that will be used to multiply the original estimate for a field.

The store

The store module in Elasticsearch is responsible for controlling how the index data is written. Our index can be stored completely in the memory or in a persistent disk storage. The pure RAM-based index will be blazingly fast but volatile, while the disk-based index will be slower but tolerant to failure.

By using the index.store.type property, we can specify which store type we want to use for the index. The available options are as follows:

· simplefs: This is a disk-based storage that accesses the index files by using the random access files. It doesn't offer good performance for concurrent access and thus, it is not advised to be used in production.

· niofs: This is the second one of the disk-based index storages that uses Java NIO classes to access the index files. It offers very good performance in highly concurrent environments, but it is not advised to be used on Windows-based deployments because of the Java implementation bugs.

· mmapfs: This is another disk-based storage that maps index files in the memory (please have a look at what mmap is at http://en.wikipedia.org/wiki/Mmap). This is the default storage for 64-bit systems and allows a more efficient reading of the index because of the same operating system-based cache being used for index files access. You need to be sure to have a good amount of virtual address space, but on 64-bits systems, you shouldn't have problems with this.

· memory: This stores the index in RAM memory. Please remember that you need to have enough physical memory to store all the documentsor Elasticsearch will fail.

Index buffers and the refresh rate

When it comes to indices, Elasticsearch allows you to set the amount of memory that can be consumed for indexing purposes. The indices.memory.index_buffer_size property (defaults to 10%) allows us to control the total amount of memory (or a percentage of the maximum heap memory) that can be divided between the shards of all the indices on a given node. For example, setting this property to 20% will tell Elasticsearch to give 20 percent of the maximum heap size to index buffers.

In addition to this, we have indices.memory.min_shard_index_buffer_size, which defaults to 4mb and allows us to set the value of minimum indexing buffer per shard.

The index refresh rate

One last thing about the indices is the index.refresh_interval property specified per index. It defaults to 1s (one second) and specifies how often the index searcher object is refreshed, which basically means how often the data view is refreshed. The lower the refresh rate, the sooner the documents will be visible for search operations. However, it also means that Elasticsearch will need to put in more resources for refreshing the index view, which means that the indexing and searching operations will be slower.

Note

For massive bulk indexing, for example, when reindexing your data, it is advisable to set the index.refresh_interval property to -1 at the time of indexing.

The thread pool configuration

Elasticsearch uses several pools to allow control over how threads are handled and how far memory consumption is allowed for user requests.

Note

The Java virtual machine allows an application to have multiple threads, concurrently running forks of application execution. For more information about Java threads, please refer to http://docs.oracle.com/javase/7/docs/api/java/lang/Thread.html.

We are especially interested in the following types of thread pools exposed by Elasticsearch:

· cache: This is an unbounded thread pool that will create a thread for each incoming request.

· fixed: This is a thread pool that has a fixed size (specified by the size property) and allows you to specify a queue (specified by the queue_size property) that will be used to hold requests until there is a free thread that can execute a queue request. If Elasticsearch isn't able to put a new request in the queue (if the queue is full), the request will be rejected.

There are many thread pools, (we can specify the type we are configuring by specifying the type property); however, when it comes to performance, the most important are as follows:

· index: This thread pool is used to index and delete operations. Its type defaults to fixed, its size to the number of available processors, and the size of the queue to 300.

· search: This thread pool is used for search and count requests. Its type defaults to fixed, its size to the number of available processors multiplied by 3, and the size of the queue to 1000.

· suggest: This thread pool is used for suggest requests. Its type defaults to fixed, its size to the number of available processors, and the size of the queue to 1000.

· get: This thread pool is used for real-time GET requests. Its type defaults to fixed, its size to the number of available processors, and the size of the queue to 1000.

· bulk: As you can guess, this thread pool is used for bulk operations. Its type defaults to fixed, its size to the number of available processors, and the size of the queue to 50.

· percolate: This thread pool is used for percolation requests. Its type defaults to fixed, its size to the number of available processors, and the size of the queue to 1000.

For example, if we would like to configure the thread pool for indexing operations to be of the fixed type, have a size of 100, and a queue of 500, we would set the following in the elasticsearch.yml configuration file:

threadpool.index.type: fixed

threadpool.index.size: 100

threadpool.index.queue_size: 500

Remember that the thread-pool configuration can be updated using the cluster update API, as follows:

curl -XPUT 'localhost:9200/_cluster/settings' -d '{

"transient" : {

"threadpool.index.type" : "fixed",

"threadpool.index.size" : 100,

"threadpool.index.queue_size" : 500

}

}'

Combining it all together – some general advice

Now that we know about the caches and buffers exposed by Elasticsearch, we can try combining this knowledge to configure the cluster for a high indexing and query throughput. In the next two sections, we will discuss what can be changed in the default configuration and what you should pay attention to when setting up your cluster.

Before we discuss all the things related to Elasticsearch specific configuration, we should remember that we have to give enough memory to Elasticsearch—physical memory. In general, we shouldn't give more than 50 to 60 percent of the total available memory to the JVM process running Elasticsearch. We do this because we want to leave some memory free for the operating system and for the operating system I/O cache.

However, we need to remember that the 50 to 60 percent is not always true. You can imagine having nodes with 256 GB of RAM and having an index with a total weight of 30 GB on that node. In such circumstances, even assigning more than 60 percent of the physical RAM to Elasticsearch would leave plenty of RAM for the operating system. It is also a good idea to set the Xmx and Xms arguments to the same values in order to avoid JVM heap size resizing.

One thing to remember when tuning your system is the performance tests that can be repeated under the same circumstances. Once you have made a change, you need to be able to see how it affects the overall performance. In addition to this, Elasticsearch scales, and because of this, it is sometimes a good thing to do a simple performance test on a single machine, see how it performs, and what we can get from it. Such observations may be a good starting point for further tuning.

Before we continue, note that we can't give you recipes for high indexing and querying because each deployment is different. Because of this, we will only discuss what you should pay attention to when tuning. However, if you are interested in such use cases, you can visit http://blog.sematext.com when one of the authors writes about performance.

Choosing the right store

Of course, apart from the physical memory, about which we've already talked, we should choose the right store implementation. In general, if you are running a 64-bit operating system, you should again go for mmapfs. If you are not running a 64-bit operating system, choose the niofs store for UNIX-based systems and simplefs for Windows-based ones. If you can allow yourself to have a volatile store, but want it to be very fast, you can look at the memory store; it will give you the best index access performance but requires enough memory to handle not only all the index files, but also to handle indexing and querying.

The index refresh rate

The second thing we should pay attention to is the index refresh rate. We know that the refresh rate specifies how fast documents will be visible for search operations. The equation is quite simple; the faster the refresh rate, the slower the queries will be and the lower the indexing throughput. If we can allow ourselves to have a slower refresh rate, such as 10s or 30s, it may be a good thing to set it. This puts less pressure on Elasticsearch as the internal objects will have to be reopened at a slower pace and thus, more resources will be available both for indexing and querying.

Tuning the thread pools

We really suggest tuning the default thread pools, especially for querying operations. After performance tests, you usually see when your cluster is overwhelmed with queries. This is the point when you should start rejecting the requests. We think that in most cases it is better to reject the request right away rather than put it in the queue and force the application to wait for very long periods of time to have that request processed. We would really like to give you a precise number, but that again highly depends on the deployment and general advice is rarely possible.

Tuning your merge process

The merge process is another thing that is highly dependent on your use case and also depends on several factors such as whether you are indexing, how much data you add, and how often you do that. In general, remember that queries against an index with multiple segments are slower than the ones with a smaller number of segments. But again, to have a smaller number of segments, you need to pay the price of merging more often.

We discussed segment merging in the Introduction to segment merging section of Chapter 2, Indexing Your Data. We also mentioned throttling, which allows us to limit the I/O operations.

Generally, if you want your queries to be faster, aim for fewer segments for your indices. If you want indexing to be faster, go for more segments for indices. If you want both of these things, you need to find a golden spot between these two so that the merging is not too often but also doesn't result in an extensive number of segments. Use concurrent merge scheduler and tune default throttling value so that your I/O subsystem is not overwhelmed by merging.

The field data cache and breaking the circuit

By default, the field data cache in Elasticsearch is unbound. This can be very dangerous, especially when you are using faceting and sorting on many fields. If the fields have high cardinality, you can run into even more trouble; by trouble, we mean you can run out of memory.

We have two different factors that we can tune to be sure that we don't run into out-of-memory errors. First, we can limit the size of the field data cache. The second is the circuit breaker, which we can easily configure to just throw an exception instead of loading too much data. Combining these two things will ensure that we don't run into memory issues.

However, we should also remember that Elasticsearch will evict data from the field data cache if its size is not enough to handle the faceting request or sorting. This will affect the query performance because loading the field data information is not very efficient. However, we think that it is better to have our queries slower than to have our cluster blown up because of the out-of-memory errors.

RAM buffer for indexing

Remember, the more the available RAM for indexing the buffer (the indices.memory.index_buffer_size property), the more documents Elasticsearch can hold in memory. But of course, we don't want to occupy 100 percent of the available memory with just Elasticsearch. By default, this is set to 10 percent, but if you really need a high indexing rate, you can increase the percentage. We've seen this property being set to 30 percent or some clusters that were focusing on data indexing and it really helped.

Tuning transaction logging

We haven't discussed this, but Elasticsearch has an internal module called translog (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-translog.html). It is a per-shard structure that serves the purpose of write-ahead logging (http://en.wikipedia.org/wiki/Write-ahead_logging). Basically, it allows Elasticsearch to expose the newest updates for GET operations, ensure data durability, and optimize the writing to Lucene indices.

By default, Elasticsearch keeps a maximum of 5000 operations in the transaction log with a maximum size of 200mb. However, if we can pay the price of the data not being available for search operations for longer periods of time but we want more indexing throughput, we can increase the defaults. By specifying the index.translog.flush_threshold_ops and index.translog.flush_threshold_size properties (both are set as per index and can be updated in real time using the Elasticsearch API), we can set the maximum number of operations allowed to be stored in the transaction log and its maximum size. We've seen deployments having this property value set to ten times the default value.

One thing to remember is that in case of a failure, shard initialization will be slower—of course, on the ones that had large transaction logs. This is because Elasticsearch needs to process all the information from the transaction log before the shard is ready for usage.

Things to keep in mind

Of course, the preceding mentioned factors are not everything that matters. You should monitor your Elasticsearch cluster and react accordingly to what you see. For example, if you see that the number of segments in you indices starts to grow and you don't want this, tune your merge policy. When you see merging taking too much I/O resources and affecting the overall performance, tune throttling. Just keep in mind that tuning won't be a one-time thing; your data will grow and so will your query number, and you'll have to adapt to that.

Templates and dynamic templates

In the Mappings configuration section of Chapter 2, Indexing Your Data, we read about mappings, how they are created, and how the type-determining mechanism works. Now we will get into more advanced topics; we will show you how to dynamically create mappings for new indices and how to apply some logic to the templates.

Templates

As we have seen earlier in the book, the index configuration and mappings in particular can be complicated beasts. It would be very nice if there was a possibility of defining one or more mappings once and using them in every newly created index without the need of sending them every time an index is created. Elasticsearch creators predicted this and implemented a feature called index templates. Each template defines a pattern, which is compared to a newly created index name. When both of them match, values defined in the template are copied to the index structure definition. When multiple templates match the name of the newly created index, all of them are applied and values from the later applied templates override the values defined in the previously applied templates. This is very convenient because we can define a few common settings in the more general templates and change them in the more specialized ones. In addition, there is an order parameter that lets us force the desired template ordering. You can think of templates as dynamic mappings that can be applied not to the types in documents, but to the indices.

An example of a template

Let's see a real example of a template. Imagine that we want to create many indices in which we don't want to store the source of the documents so that our indices will be smaller. We also don't need any replicas. We can create a template that matches our need by using the Elasticsearch REST API by sending the following command:

curl -XPUT http://localhost:9200/_template/main_template?pretty -d '{

"template" : "*",

"order" : 1,

"settings" : {

"index.number_of_replicas" : 0

},

"mappings" : {

"_default_" : {

"_source" : {

"enabled" : false

}

}

}

}'

From now on, all the created indices will have no replicas and no source stored. That's because the template parameter value is set to *, which matches all the names of the indices. Note the _default_ type name in our example. This is a special type name, which indicates that the current rule should be applied to every document type. The second interesting thing is the order parameter. Let's define a second template by using the following command:

curl -XPUT http://localhost:9200/_template/ha_template?pretty -d '{

"template" : "ha_*",

"order" : 10,

"settings" : {

"index.number_of_replicas" : 5

}

}'

After running the preceding command, all the new indices will behave as before except the ones with the names beginning with ha_. In case of these indices, both the templates are applied. First, the template with the lower order value is used and then the next template overwrites the replica's setting. So, indices whose names start with ha_ will have five replicas and disabled sources stored.

Storing templates in files

Templates can also be stored in files. By default, files should be placed in the config/templates directory. For example, our ha_template template should be placed in the config/templates/ha_template.json file and have the following content:

{

"ha_template" : {

"template" : "ha_*",

"order" : 10,

"settings" : {

"index.number_of_replicas" : 5

}

}

}

Note that the structure of JSON is a little bit different and has the template name as the main object key. The second important thing is that templates must be placed on every instance of Elasticsearch. Also, the templates defined in the files are not available with the REST API calls.

Dynamic templates

Sometimes, we want to have a possibility of defining type that is dependent on the field name and the type. This is where dynamic templates can help. Dynamic templates are similar to the usual mappings, but each template has its pattern defined, which is applied to a document's field name. If a field name matches the pattern, the template is used. Let's have a look at the following example:

{

"mappings" : {

"article" : {

"dynamic_templates" : [

{

"template_test": {

"match" : "*",

"mapping" : {

"index" : "analyzed",

"fields" : {

"str": {"type": "{dynamic_type}","index": "not_analyzed" }

}

}

}

}]

}

}

}

In the preceding example, we defined a mapping for the article type. In this mapping, we have only one dynamic template named template_test. This template is applied for every field in the input document because of the single asterisk pattern in the match property. Each field will be treated as a multifield, consisting of a field named as the original field (for example, title) and the second field with a name suffixed with str (for example, title.str). The first field will have its type determined by Elasticsearch (with the{dynamic_type} type), and the second field will be a string (because of the string type).

The matching pattern

We have two ways of defining the matching pattern; they are as follows:

· match: This template is used if the name of the field matches the pattern (this pattern type was used in our example)

· unmatch: This template is used if the name of the field doesn't match the pattern

By default, the pattern is very simple and uses glob patterns. This can be changed by using match_pattern=regexp. After adding this property, we can use all the magic provided by regular expressions to match and unmatch patterns.

There are variations such as path_match and path_unmatch that can be used to match the names in nested documents.

Field definitions

When writing a target field definition, the following variables can be used:

· {name}: The name of the original field found in the input document

· {dynamic_type}: The type determined from the original document

Note

Please note that Elasticsearch checks templates in the order of their definitions and the first matching template is applied. This means that the most generic templates (for example, with "match": "*") must be defined at the end.

Summary

In this chapter, we learned a few things about Elasticsearch such as the node discovery, what this module is responsible for, and how to tune it. We also looked at the recovery and gateway modules, how to set them up to match our cluster, and what configuration options they provide. We also discussed some of the Elasticsearch internals, and we used these to tune our cluster for high indexing and high querying use cases. And finally, we've used templates and dynamic mappings to help us manage our dynamic indices better.

In the next chapter, we'll focus on some of the Elasticsearch administration capabilities. We will learn how to back up our cluster data and how to monitor our cluster using the available API calls. We will discuss how to control shard allocation and how to move shards around the cluster, again using the Elasticsearch API. We'll learn what index warmers are and how they can help us, and we will use aliases. Finally, we'll learn how to install and manage Elasticsearch plugins and what we can do using the update settings API.