Elasticsearch Administration - Mastering Elasticsearch, Second Edition (2015)

Mastering Elasticsearch, Second Edition (2015)

Chapter 7. Elasticsearch Administration

In the previous chapter, we discussed how to alter the Apache Lucene scoring by using different similarity methods. We indexed and searched our data in a near real-time manner, and we also learned how to flush and refresh our data. We configured the transaction log and the throttled I/O subsystem. We talked about segment merging and how to visualize it. Finally, we discussed federated search and the usage of tribe nodes in Elasticsearch.

In this chapter, we will talk more about the Elasticsearch configuration and new features introduced in Elasticsearch 1.0 and higher. By the end of this chapter, you will have learned:

· Configuring the discovery and recovery modules

· Using the Cat API that allows a human-readable insight into the cluster status

· The backup and restore functionality

· Federated search

Discovery and recovery modules

When starting your Elasticsearch node, one of the first things that Elasticsearch does is look for a master node that has the same cluster name and is visible in the network. If a master node is found, the starting node gets joined into an already formed cluster. If no master is found, then the node itself is selected as a master (of course, if the configuration allows such behavior). The process of forming a cluster and finding nodes is called discovery. The module responsible for discovery has two main purposes—electing a master and discovering new nodes within a cluster.

After the cluster is formed, a process called recovery is started. During the recovery process, Elasticsearch reads the metadata and the indices from the gateway, and prepares the shards that are stored there to be used. After the recovery of the primary shards is done, Elasticsearch should be ready for work and should continue with the recovery of all the replicas (if they are present).

In this section, we will take a deeper look at these two modules and discuss the possibilities of configuration Elasticsearch gives us and what the consequences of changing them are.

Note

Note that the information provided in the Discovery and recovery modules section is an extension of what we already wrote in Elasticsearch Server Second Edition, published by Packt Publishing.

Discovery configuration

As we have already mentioned multiple times, Elasticsearch was designed to work in a distributed environment. This is the main difference when comparing Elasticsearch to other open source search and analytics solutions available. With such assumptions, Elasticsearch is very easy to set up in a distributed environment, and we are not forced to set up additional software to make it work like this. By default, Elasticsearch assumes that the cluster is automatically formed by the nodes that declare the same cluster.namesetting and can communicate with each other using multicast requests. This allows us to have several independent clusters in the same network.

There are a few implementations of the discovery module that we can use, so let's see what the options are.

Zen discovery

Zen discovery is the default mechanism that's responsible for discovery in Elasticsearch and is available by default. The default Zen discovery configuration uses multicast to find other nodes. This is a very convenient solution: just start a new Elasticsearch node and everything works—this node will be joined to the cluster if it has the same cluster name and is visible by other nodes in that cluster. This discovery method is perfectly suited for development time, because you don't need to care about the configuration; however, it is not advised that you use it in production environments. Relying only on the cluster name is handy but can also lead to potential problems and mistakes, such as the accidental joining of nodes. Sometimes, multicast is not available for various reasons or you don't want to use it for these mentioned reasons. In the case of bigger clusters, the multicast discovery may generate too much unnecessary traffic, and this is another valid reason why it shouldn't be used for production.

For these cases, Zen discovery allows us to use the unicast mode. When using the unicast Zen discovery, a node that is not a part of the cluster will send a ping request to all the addresses specified in the configuration. By doing this, it informs all the specified nodes that it is ready to be a part of the cluster and can be either joined to an existing cluster or can form a new one. Of course, after the node joins the cluster, it gets the cluster topology information, but the initial connection is only done to the specified list of hosts. Remember that even when using unicast Zen discovery, the Elasticsearch node still needs to have the same cluster name as the other nodes.

Note

If you want to know more about the differences between multicast and unicast ping methods, refer to these URLs: http://en.wikipedia.org/wiki/Multicast and http://en.wikipedia.org/wiki/Unicast.

If you still want to learn about the configuration properties of multicast Zen discovery, let's look at them.

Multicast Zen discovery configuration

The multicast part of the Zen discovery module exposes the following settings:

· discovery.zen.ping.multicast.address (the default: all available interfaces): This is the interface used for the communication given as the address or interface name.

· discovery.zen.ping.multicast.port (the default: 54328): This port is used for communication.

· discovery.zen.ping.multicast.group (the default: 224.2.2.4): This is the multicast address to send messages to.

· discovery.zen.ping.multicast.buffer_size (the default: 2048): This is the size of the buffer used for multicast messages.

· discovery.zen.ping.multicast.ttl (the default: 3): This is the time for which a multicast message lives. Every time a packet crosses the router, the TTL is decreased. This allows for the limiting area where the transmission can be received. Note that routers can have the threshold values assigned compared to TTL, which causes that TTL value to not match exactly the number of routers that a packet can jump over.

· discovery.zen.ping.multicast.enabled (the default: true): Setting this property to false turns off the multicast. You should disable multicast if you are planning to use the unicast discovery method.

The unicast Zen discovery configuration

The unicast part of Zen discovery provides the following configuration options:

· discovery.zen.ping.unicats.hosts: This is the initial list of nodes in the cluster. The list can be defined as a list or as an array of hosts. Every host can be given a name (or an IP address) or have a port or port range added. For example, the value of this property can look like this: ["master1", "master2:8181", "master3[80000-81000]"]. So, basically, the hosts' list for the unicast discovery doesn't need to be a complete list of Elasticsearch nodes in your cluster, because once the node is connected to one of the mentioned nodes, it will be informed about all the others that form the cluster.

· discovery.zen.ping.unicats.concurrent_connects (the default: 10): This is the maximum number of concurrent connections unicast discoveries will use. If you have a lot of nodes that the initial connection should be made to, it is advised that you increase the default value.

Master node

One of the main purposes of discovery apart from connecting to other nodes is to choose a master node—a node that will take care of and manage all the other nodes. This process is called master election and is a part of the discovery module. No matter how many master eligible nodes there are, each cluster will only have a single master node active at a given time. If there is more than one master eligible node present in the cluster, they can be elected as the master when the original master fails and is removed from the cluster.

Configuring master and data nodes

By default, Elasticsearch allows every node to be a master node and a data node. However, in certain situations, you may want to have worker nodes, which will only hold the data or process the queries and the master nodes that will only be used as cluster-managed nodes. One of these situations is to handle a massive amount of data, where data nodes should be as performant as possible, and there shouldn't be any delay in master nodes' responses.

Configuring data-only nodes

To set the node to only hold data, we need to instruct Elasticsearch that we don't want such a node to be a master node. In order to do this, we add the following properties to the elasticsearch.yml configuration file:

node.master: false

node.data: true

Configuring master-only nodes

To set the node not to hold data and only to be a master node, we need to instruct Elasticsearch that we don't want such a node to hold data. In order to do that, we add the following properties to the elasticsearch.yml configuration file:

node.master: true

node.data: false

Configuring the query processing-only nodes

For large enough deployments, it is also wise to have nodes that are only responsible for aggregating query results from other nodes. Such nodes should be configured as nonmaster and nondata, so they should have the following properties in the elasticsearch.ymlconfiguration file:

node.master: false

node.data: false

Note

Please note that the node.master and the node.data properties are set to true by default, but we tend to include them for configuration clarity.

The master election configuration

We already wrote about the master election configuration in Elasticsearch Server Section Edition, but this topic is very important, so we decided to refresh our knowledge about it.

Imagine that you have a cluster that is built of 10 nodes. Everything is working fine until, one day, your network fails and three of your nodes are disconnected from the cluster, but they still see each other. Because of the Zen discovery and the master election process, the nodes that got disconnected elect a new master and you end up with two clusters with the same name with two master nodes. Such a situation is called a split-brain and you must avoid it as much as possible. When a split-brain happens, you end up with two (or more) clusters that won't join each other until the network (or any other) problems are fixed. If you index your data during this time, you may end up with data loss and unrecoverable situations when the nodes get joined together after the network split.

In order to prevent split-brain situations or at least minimize the possibility of their occurrences, Elasticsearch provides a discovery.zen.minimum_master_nodes property. This property defines a minimum amount of master eligible nodes that should be connected to each other in order to form a cluster. So now, let's get back to our cluster; if we set the discovery.zen.minimum_master_nodes property to 50 percent of the total nodes available plus one (which is six, in our case), we would end up with a single cluster. Why is that? Before the network failure, we would have 10 nodes, which is more than six nodes, and these nodes would form a cluster. After the disconnections of the three nodes, we would still have the first cluster up and running. However, because only three nodes disconnected and three is less than six, these three nodes wouldn't be allowed to elect a new master and they would wait for reconnection with the original cluster.

Zen discovery fault detection and configuration

Elasticsearch runs two detection processes while it is working. The first process is to send ping requests from the current master node to all the other nodes in the cluster to check whether they are operational. The second process is a reverse of that—each of the nodes sends ping requests to the master in order to verify that it is still up and running and performing its duties. However, if we have a slow network or our nodes are in different hosting locations, the default configuration may not be sufficient. Because of this, the Elasticsearch discovery module exposes three properties that we can change:

· discovery.zen.fd.ping_interval: This defaults to 1s and specifies the interval of how often the node will send ping requests to the target node.

· discovery.zen.fd.ping_timeout: This defaults to 30s and specifies how long the node will wait for the sent ping request to be responded to. If your nodes are 100 percent utilized or your network is slow, you may consider increasing that property value.

· discovery.zen.fd.ping_retries: This defaults to 3 and specifies the number of ping request retries before the target node will be considered not operational. You can increase this value if your network has a high number of lost packets (or you can fix your network).

There is one more thing that we would like to mention. The master node is the only node that can change the state of the cluster. To achieve a proper cluster state updates sequence, Elasticsearch master nodes process single cluster state update requests one at a time, make the changes locally, and send the request to all the other nodes so that they can synchronize their state. The master nodes wait for the given time for the nodes to respond, and if the time passes or all the nodes are returned, with the current acknowledgment information, it proceeds with the next cluster state update request processing. To change the time, the master node waits for all the other nodes to respond, and you should modify the default 30 seconds time by setting thediscovery.zen.publish_timeout property. Increasing the value may be needed for huge clusters working in an overloaded network.

The Amazon EC2 discovery

Amazon, in addition to selling goods, has a few popular services such as selling storage or computing power in a pay-as-you-go model. So-called Amazon Elastic Compute Cloud (EC2) provides server instances and, of course, they can be used to install and run Elasticsearch clusters (among many other things, as these are normal Linux machines). This is convenient—you pay for instances that are needed in order to handle the current traffic or to speed up calculations, and you shut down unnecessary instances when the traffic is lower. Elasticsearch works well on EC2, but due to the nature of the environment, some features may work slightly differently. One of these features that works differently is discovery, because Amazon EC2 doesn't support multicast discovery. Of course, we can switch to unicast discovery, but sometimes, we want to be able to automatically discover nodes and, with unicast, we need to at least provide the initial list of hosts. However, there is an alternative—we can use the Amazon EC2 plugin, a plugin that combines the multicast and unicast discovery methods using the Amazon EC2 API.

Note

Make sure that during the set up of EC2 instances, you set up communication between them (on port 9200 and 9300 by default). This is crucial in order to have Elasticsearch nodes communicate with each other and, thus, cluster functioning is required. Of course, this communication depends on network.bind_host and network.publish_host (or network.host) settings.

The EC2 plugin installation

The installation of a plugin is as simple as with most of the plugins. In order to install it, we should run the following command:

bin/plugin install elasticsearch/elasticsearch-cloud-aws/2.4.0

The EC2 plugin's generic configuration

This plugin provides several configuration settings that we need to provide in order for the EC2 discovery to work:

· cluster.aws.access_key: Amazon access key—one of the credential values you can find in the Amazon configuration panel

· cluster.aws.secret_key: Amazon secret key—similar to the previously mentioned access_key setting, it can be found in the EC2 configuration panel

The last thing is to inform Elasticsearch that we want to use a new discovery type by setting the discovery.type property to ec2 value and turn off multicast.

Optional EC2 discovery configuration options

The previously mentioned settings are sufficient to run the EC2 discovery, but in order to control the EC2 discovery plugin behavior, Elasticsearch exposes additional settings:

· cloud.aws.region: This region will be used to connect with Amazon EC2 web services. You can choose a region that's adequate for the region where your instance resides, for example, eu-west-1 for Ireland. The possible values during the writing of the book were eu-west, sa-east, us-east, us-west-1, us-west-2, ap-southeast-1, and ap-southeast-1.

· cloud.aws.ec2.endpoint: If you are using EC2 API services, instead of defining a region, you can provide an address of the AWS endpoint, for example, ec2.eu-west-1.amazonaws.com.

· cloud.aws.protocol: This is the protocol that should be used by the plugin to connect to Amazon Web Services endpoints. By default, Elasticsearch will use the HTTPS protocol (which means setting the value of the property to https). We can also change this behavior and set the property to http for the plugin to use HTTP without encryption. We are also allowed to overwrite the cloud.aws.protocol settings for each service by using the cloud.aws.ec2.protocol and cloud.aws.s3.protocol properties (the possible values are the same—https and http).

· cloud.aws.proxy_host: Elasticsearch allows us to define a proxy that will be used to connect to AWS endpoints. The cloud.aws.proxy_host property should be set to the address to the proxy that should be used.

· cloud.aws.proxy_port: The second property related to the AWS endpoints proxy allows us to specify the port on which the proxy is listening. The cloud.aws.proxy_port property should be set to the port on which the proxy listens.

· discovery.ec2.ping_timeout (the default: 3s): This is the time to wait for the response for the ping message sent to the other node. After this time, the nonresponsive node will be considered dead and removed from the cluster. Increasing this value makes sense when dealing with network issues or we have a lot of EC2 nodes.

The EC2 nodes scanning configuration

The last group of settings we want to mention allows us to configure a very important thing when building cluster working inside the EC2 environment—the ability to filter available Elasticsearch nodes in our Amazon Elastic Cloud Computing network. The Elasticsearch EC2 plugin exposes the following properties that can help us configure its 'margin-left:18.0pt;text-indent:-18.0pt;line-height: normal'>· discovery.ec2.host_type: This allows us to choose the host type that will be used to communicate with other nodes in the cluster. The values we can use are private_ip (the default one; the private IP address will be used for communication), public_ip (the public IP address will be used for communication), private_dns (the private hostname will be used for communication), and public_dns (the public hostname will be used for communication).

· discovery.ec2.groups: This is a comma-separated list of security groups. Only nodes that fall within these groups can be discovered and included in the cluster.

· discovery.ec2.availability_zones: This is array or command-separated list of availability zones. Only nodes with the specified availability zones will be discovered and included in the cluster.

· discovery.ec2.any_group (this defaults to true): Setting this property to false will force the EC2 discovery plugin to discover only those nodes that reside in an Amazon instance that falls into all of the defined security groups. The default value requires only a single group to be matched.

· discovery.ec2.tag: This is a prefix for a group of EC2-related settings. When you launch your Amazon EC2 instances, you can define tags, which can describe the purpose of the instance, such as the customer name or environment type. Then, you use these defined settings to limit discovery nodes. Let's say you define a tag named environment with a value of qa. In the configuration, you can now specify the following:

· discovery.ec2.tag.environment: qa and only nodes running on instances with this tag will be considered for discovery.

· cloud.node.auto_attributes: When this is set to true, Elasticsearch will add EC2-related node attributes (such as the availability zone or group) to the node properties and will allow us to use them, adjusting the Elasticsearch shard allocation and configuring the shard placement. You can find more about shard placement in the Altering the default shard allocation behavior section of Chapter 5, The Index Distribution Architecture.

Other discovery implementations

The Zen discovery and EC2 discovery are not the only discovery types that are available. There are two more discovery types that are developed and maintained by the Elasticsearch team, and these are:

· Azure discovery: https://github.com/elasticsearch/elasticsearch-cloud-azure

· Google Compute Engine discovery: https://github.com/elasticsearch/elasticsearch-cloud-gce

In addition to these, there are a few discovery implementations provided by the community, such as the ZooKeeper discovery for older versions of Elasticsearch (https://github.com/sonian/elasticsearch-zookeeper).

The gateway and recovery configuration

The gateway module allows us to store all the data that is needed for Elasticsearch to work properly. This means that not only is the data in Apache Lucene indices stored, but also all the metadata (for example, index allocation settings), along with the mappings configuration for each index. Whenever the cluster state is changed, for example, when the allocation properties are changed, the cluster state will be persisted by using the gateway module. When the cluster is started up, its state will be loaded using the gateway module and applied.

Note

One should remember that when configuring different nodes and different gateway types, indices will use the gateway type configuration present on the given node. If an index state should not be stored using the gateway module, one should explicitly set the index gateway type to none.

The gateway recovery process

Let's say explicitly that the recovery process is used by Elasticsearch to load the data stored with the use of the gateway module in order for Elasticsearch to work. Whenever a full cluster restart occurs, the gateway process kicks in to load all the relevant information we've mentioned—the metadata, the mappings, and of course, all the indices. When the recovery process starts, the primary shards are initialized first, and then, depending on the replica state, they are initialized using the gateway data, or the data is copied from the primary shards if the replicas are out of sync.

Elasticsearch allows us to configure when the cluster data should be recovered using the gateway module. We can tell Elasticsearch to wait for a certain number of master eligible or data nodes to be present in the cluster before starting the recovery process. However, one should remember that when the cluster is not recovered, all the operations performed on it will not be allowed. This is done in order to avoid modification conflicts.

Configuration properties

Before we continue with the configuration, we would like to say one more thing. As you know, Elasticsearch nodes can play different roles—they can have a role of data nodes—the ones that hold data—they can have a master role, or they can be only used for request handing, which means not holding data and not being master eligible. Remembering all this, let's now look at the gateway configuration properties that we are allowed to modify:

· gateway.recover_after_nodes: This is an integer number that specifies how many nodes should be present in the cluster for the recovery to happen. For example, when set to 5, at least 5 nodes (doesn't matter whether they are data or master eligible nodes) must be present for the recovery process to start.

· gateway.recover_after_data_nodes: This is an integer number that allows us to set how many data nodes should be present in the cluster for the recovery process to start.

· gateway.recover_after_master_nodes: This is another gateway configuration option that allows us to set how many master eligible nodes should be present in the cluster for the recovery to start.

· gateway.recover_after_time: This allows us to set how much time to wait before the recovery process starts after the conditions defined by the preceding properties are met. If we set this property to 5m, we tell Elasticsearch to start the recovery process 5 minutes after all the defined conditions are met. The default value for this property is 5m, starting from Elasticsearch 1.3.0.

Let's imagine that we have six nodes in our cluster, out of which four are data eligible. We also have an index that is built of three shards, which are spread across the cluster. The last two nodes are master eligible and they don't hold the data. What we would like to configure is the recovery process to be delayed for 3 minutes after the four data nodes are present. Our gateway configuration could look like this:

gateway.recover_after_data_nodes: 4

gateway.recover_after_time: 3m

Expectations on nodes

In addition to the already mentioned properties, we can also specify properties that will force the recovery process of Elasticsearch. These properties are:

· gateway.expected_nodes: This is the number of nodes expected to be present in the cluster for the recovery to start immediately. If you don't need the recovery to be delayed, it is advised that you set this property to the number of nodes (or at least most of them) with which the cluster will be formed from, because that will guarantee that the latest cluster state will be recovered.

· gateway.expected_data_nodes: This is the number of expected data eligible nodes to be present in the cluster for the recovery process to start immediately.

· gateway.expected_master_nodes: This is the number of expected master eligible nodes to be present in the cluster for the recovery process to start immediately.

Now, let's get back to our previous example. We know that when all six nodes are connected and are in the cluster, we want the recovery to start. So, in addition to the preceeding configuration, we would add the following property:

gateway.expected_nodes: 6

So the whole configuration would look like this:

gateway.recover_after_data_nodes: 4

gateway.recover_after_time: 3m

gateway.expected_nodes: 6

The preceding configuration says that the recovery process will be delayed for 3 minutes once four data nodes join the cluster and will begin immediately after six nodes are in the cluster (doesn't matter whether they are data nodes or master eligible nodes).

The local gateway

With the release of Elasticsearch 0.20 (and some of the releases from 0.19 versions), all the gateway types, apart from the default local gateway type, were deprecated. It is advised that you do not use them, because they will be removed in future versions of Elasticsearch. This is still not the case, but if you want to avoid full data reindexation, you should only use the local gateway type, and this is why we won't discuss all the other types.

The local gateway type uses a local storage available on a node to store the metadata, mappings, and indices. In order to use this gateway type and the local storage available on the node, there needs to be enough disk space to hold the data with no memory caching.

The persistence to the local gateway is different from the other gateways that are currently present (but deprecated). The writes to this gateway are done in a synchronous manner in order to ensure that no data will be lost during the write process.

Note

In order to set the type of gateway that should be used, one should use the gateway.type property, which is set to local by default.

There is one additional thing regarding the local gateway of Elasticsearch that we didn't talk about—dangling indices. When a node joins a cluster, all the shards and indices that are present on the node, but are not present in the cluster, will be included in the cluster state. Such indices are called dangling indices, and we are allowed to choose how Elasticsearch should treat them.

Elasticsearch exposes the gateway.local.auto_import_dangling property, which can take the value of yes (the default value that results in importing all dangling indices into the cluster), close (results in importing the dangling indices into the cluster state but keeps them closed by default), and no (results in removing the dangling indices). When setting the gateway.local.auto_import_dangling property to no, we can also set the gateway.local.dangling_timeout property (defaults to 2h) to specify how long Elasticsearch will wait while deleting the dangling indices. The dangling indices feature can be nice when we restart old Elasticsearch nodes, and we don't want old indices to be included in the cluster.

Low-level recovery configuration

We discussed that we can use the gateway to configure the behavior of the Elasticsearch recovery process, but in addition to that, Elasticsearch allows us to configure the recovery process itself. We mentioned some of the recovery configuration options already when talking about shard allocation in the Altering The default shard allocation behavior section of Chapter 5 The Index Distribution Architecture; however, we decided that it would be good to mention the properties we can use in the section dedicated to gateway and recovery.

Cluster-level recovery configuration

The recovery configuration is specified mostly on the cluster level and allows us to set general rules for the recovery module to work with. These settings are:

· indices.recovery.concurrent_streams: This defaults to 3 and specifies the number of concurrent streams that are allowed to be opened in order to recover a shard from its source. The higher the value of this property, the more pressure will be put on the networking layer; however, the recovery may be faster, depending on your network usage and throughput.

· indices.recovery.max_bytes_per_sec: By default, this is set to 20MB and specifies the maximum number of data that can be transferred during shard recovery per second. In order to disable data transfer limiting, one should set this property to 0. Similar to the number of concurrent streams, this property allows us to control the network usage of the recovery process. Setting this property to higher values may result in higher network utilization and a faster recovery process.

· indices.recovery.compress: This is set to true by default and allows us to define whether ElasticSearch should compress the data that is transferred during the recovery process. Setting this to false may lower the pressure on the CPU, but it will also result in more data being transferred over the network.

· indices.recovery.file_chunk_size: This is the chunk size used to copy the shard data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true.

· indices.recovery.translog_ops: This defaults to 1000 and specifies how many transaction log lines should be transferred between shards in a single request during the recovery process.

· indices.recovery.translog_size: This is the chunk size used to copy the shard transaction log data from the source shard. By default, it is set to 512KB and is compressed if the indices.recovery.compress property is set to true.

Note

In the versions prior to Elasticsearch 0.90.0, there was the indices.recovery.max_size_per_sec property that could be used, but it was deprecated, and it is suggested that you use the indices.recovery.max_bytes_per_sec property instead. However, if you are using an Elasticsearch version older than 0.90.0, it may be worth remembering this.

All the previously mentioned settings can be updated using the Cluster Update API, or they can be set in the elasticsearch.yml file.

Index-level recovery settings

In addition to the values mentioned previously, there is a single property that can be set on a per-index basis. The property can be set both in the elasticsearch.yml file and using the indices Update Settings API, and it is called index.recovery.initial_shards. In general, Elasticsearch will only recover a particular shard when there is a quorum of shards present and if that quorum can be allocated. A quorum is 50 percent of the shards for the given index plus one. By using the index.recovery.initial_shards property, we can change what Elasticsearch will take as a quorum. This property can be set to the one of the following values:

· quorum: 50 percent, plus one shard needs to be present and be allocable. This is the default value.

· quorum-1: 50 percent of the shards for a given index need to be present and be allocable.

· full: All of the shards for the given index need to be present and be allocable.

· full-1: 100 percent minus one shards for the given index need to be present and be allocable.

· integer value: Any integer such as 1, 2, or 5 specifies the number of shards that are needed to be present and that can be allocated. For example, setting this value to 2 will mean that at least two shards need to be present and Elasticsearch needs at least 2shards to be allocable.

It is good to know about this property, but in most cases, the default value will be sufficient for most deployments.

The indices recovery API

With the introduction of the indices recovery API, we are no longer limited to only looking at the cluster state and the output similar to the following one:

curl 'localhost:9200/_cluster/health?pretty'

{

"cluster_name" : "mastering_elasticsearch",

"status" : "red",

"timed_out" : false,

"number_of_nodes" : 10,

"number_of_data_nodes" : 10,

"active_primary_shards" : 9,

"active_shards" : 9,

"relocating_shards" : 0,

"initializing_shards" : 0,

"unassigned_shards" : 1

}

By running an HTTP GET request to the _recovery endpoint (for all the indices or for a particular one), we can get the information about the state of the indices' recovery. For example, let's look at the following request:

curl -XGET 'localhost:9200/_recovery?pretty'

The preceding request will return information about ongoing and finished recoveries of all the shards in the cluster. In our case, the response was as follows (we had to cut it):

{

"test_index" : {

"shards" : [ {

"id" : 3,

"type" : "GATEWAY",

"stage" : "START",

"primary" : true,

"start_time_in_millis" : 1414362635212,

"stop_time_in_millis" : 0,

"total_time_in_millis" : 175,

"source" : {

"id" : "3M_ErmCNTR-huTqOTv5smw",

"host" : "192.168.1.10",

"transport_address" : "inet[/192.168.1.10:9300]",

"ip" : "192.168.10",

"name" : "node1"

},

"target" : {

"id" : "3M_ErmCNTR-huTqOTv5smw",

"host" : "192.168.1.10",

"transport_address" : "inet[/192.168.1.10:9300]",

"ip" : "192.168.1.10",

"name" : "node1"

},

"index" : {

"files" : {

"total" : 400,

"reused" : 400,

"recovered" : 400,

"percent" : "100.0%"

},

"bytes" : {

"total" : 2455604486,

"reused" : 2455604486,

"recovered" : 2455604486,

"percent" : "100.0%"

},

"total_time_in_millis" : 28

},

"translog" : {

"recovered" : 0,

"total_time_in_millis" : 0

},

"start" : {

"check_index_time_in_millis" : 0,

"total_time_in_millis" : 0

}

}, {

"id" : 9,

"type" : "GATEWAY",

"stage" : "DONE",

"primary" : true,

"start_time_in_millis" : 1414085189696,

"stop_time_in_millis" : 1414085189729,

"total_time_in_millis" : 33,

"source" : {

"id" : "nNw_k7_XSOivvPCJLHVE5A",

"host" : "192.168.1.11",

"transport_address" : "inet[/192.168.1.11:9300]",

"ip" : "192.168.1.11",

"name" : "node3"

},

"target" : {

"id" : "nNw_k7_XSOivvPCJLHVE5A",

"host" : "192.168.1.11",

"transport_address" : "inet[/192.168.1.11:9300]",

"ip" : "192.168.1.11",

"name" : "node3"

},

"index" : {

"files" : {

"total" : 0,

"reused" : 0,

"recovered" : 0,

"percent" : "0.0%"

},

"bytes" : {

"total" : 0,

"reused" : 0,

"recovered" : 0,

"percent" : "0.0%"

},

"total_time_in_millis" : 0

},

"translog" : {

"recovered" : 0,

"total_time_in_millis" : 0

},

"start" : {

"check_index_time_in_millis" : 0,

"total_time_in_millis" : 33

},

.

.

.

]

}

}

The preceding response contains information about two shards for test_index (the information for the rest of the shards was removed for clarity). We can see that one of the shards is during the recovery process ("stage" : "START") and the second one already finished the recovery process ("stage" : "DONE"). We can see a lot of information about the recovery process, and the information is provided on the index shard level, which allows us to clearly see at what stage our Elasticsearch cluster is. We can also limit the information to only shards that are currently being recovered by adding the active_only=true parameter to our request, so it would look as follows:

curl -XGET 'localhost:9200/_recovery?active_only=true&pretty'

If we want to get even more detailed information, we can add the detailed=true parameter to our request, so it would look like this:

curl -XGET 'localhost:9200/_recovery?detailed=true&pretty'

The human-friendly status API – using the Cat API

The Elasticsearch Admin API is quite extensive and covers almost every part of its architecture—from low-level information about Lucene to high-level information about the cluster nodes and their health. All this information is available both using the Elasticsearch Java API as well as using the REST API; however, the data is returned in the JSON format. What's more—the returned data can sometimes be hard to analyze without further parsing. For example, try to run the following request on your Elasticsearch cluster:

curl -XGET 'localhost:9200/_stats?pretty'

On our local, single node cluster, Elasticsearch returns the following information (we cut it down drastically; the full response can be found in the stats.json file provided with the book):

{

"_shards" : {

"total" : 60,

"successful" : 30,

"failed" : 0

},

"_all" : {

"primaries" : {

.

.

.

},

"total" : {

.

.

.

}

},

"indices" : {

.

.

.

}

}

If you look at the provided stats.json file, you would see that the response is about 1,350 lines long. This isn't quite convenient for analysis by a human without additional parsing. Because of this, Elasticsearch provides us with a more human-friendly API—the Cat API. The special Cat API returns data in a simple text, tabular format, and what's more, it provides aggregated data that is usually usable without any further processing.

Note

Remember that we've told you that Elasticsearch allows you to get information not just in the JSON format? If you don't remember this, please try to add the format=yaml request parameter to your request.

The basics

The base endpoint for the Cat API is quite obvious—it is /_cat. Without any parameters, it shows us all the available endpoints for that API. We can check this by running the following command:

curl -XGET 'localhost:9200/_cat'

The response returned by Elasticsearch should be similar or identical (depending on your Elasticsearch version) to the following one:

=^.^=

/_cat/allocation

/_cat/shards

/_cat/shards/{index}

/_cat/master

/_cat/nodes

/_cat/indices

/_cat/indices/{index}

/_cat/segments

/_cat/segments/{index}

/_cat/count

/_cat/count/{index}

/_cat/recovery

/_cat/recovery/{index}

/_cat/health

/_cat/pending_tasks

/_cat/aliases

/_cat/aliases/{alias}

/_cat/thread_pool

/_cat/plugins

/_cat/fielddata

/_cat/fielddata/{fields}

So, looking for the top Elasticsearch allows us to get the following information using the Cat API:

· Shard allocation-related information

· All shard-related information (limited to a given index)

· Nodes information, including elected master indication

· Indices' statistics (limited to a given index)

· Segments' statistics (limited to a given index)

· Documents' count (limited to a given index)

· Recovery information (limited to a given index)

· Cluster health

· Tasks pending execution

· Index aliases and indices for a given alias

· The thread pool configuration

· Plugins installed on each node

· The field data cache size and field data cache sizes for individual fields

Using the Cat API

Let's start using the Cat API through an example. We can start with checking the cluster health of our Elasticsearch cluster. To do this, we just run the following command:

curl -XGET 'localhost:9200/_cat/health'

The response returned by Elasticsearch to the preceding command should be similar to the following one:

1414347090 19:11:30 elasticsearch yellow 1 1 47 47 0 0 47

It is clean and nice. Because it is in a tabular format, it is also easier to use the response in tools such as grep, awk, or sed—a standard set of tools for every administrator. It is also more readable once you know what it is all about. To add a header describing each column purpose, we just need to add an additional v parameter just like this:

curl -XGET 'localhost:9200/_cat/health?v'

The response is very similar to what we've seen previously, but it now contains a header describing each column:

epoch timestamp cluster status node.total node.data shards pri relo init unassign

1414347107 19:11:47 elasticsearch yellow 1 1 47 47 0 0 47

Common arguments

Every Cat API endpoint has its own arguments, but there are a few common options that are shared among all of them:

· v: This adds a header line to response with names of presented items.

· h: This allows us to show only chosen columns (refer to the next section).

· help: This lists all possible columns that this particular endpoint is able to show. The command shows the name of the parameter, its abbreviation, and the description.

· bytes: This is the format for information representing values in bytes. As we said, the Cat API is designed to be used by humans and, because of that, these values are represented in a human-readable form by default, for example, 3.5kB or 40GB. The bytesoption allows us to set the same base for all numbers, so sorting or numerical comparison will be easier. For example, bytes=b presents all values in bytes, bytes=k in kilobytes, and so on.

Note

For the full list of arguments for each Cat API endpoint, refer to the official Elasticsearch documentation available at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat.html.

The examples

When we wrote this book, the Cat API had 21 endpoints. We don't want to describe them all—it would be a repetition of information contained in the documentation or chapters about the administration API. However, we didn't want to leave this section without any example regarding the usage of the Cat API. Because of this, we decided to show you how easily you can get information using the Cat API compared to the standard JSON API exposed by Elasticsearch.

Getting information about the master node

The first example shows you how easy it is to get information about which node in our cluster is the master node. By calling the /_cat/master REST endpoint, we can get information about the nodes and which one of them is currently being elected as a master. For example, let's run the following command:

curl -XGET 'localhost:9200/_cat/master?v'

The response returned by Elasticsearch for my local two nodes cluster looks as follows:

id host ip node

8gfdQlV-SxKB0uUxkjbxSg Banshee.local 10.0.1.3 Siege

As you can see in the response, we've got the information about which node is currently elected as the master—we can see its identifier, IP address, and name.

Getting information about the nodes

The /_cat/nodes REST endpoint provides information about all the nodes in the cluster. Let's see what Elasticsearch will return after running the following command:

curl -XGET 'localhost:9200/_cat/nodes?v&h=name,node.role,load,uptime'

In the preceding example, we have used the possibility of choosing what information we want to get from the approximately 70 options for this endpoint. We have chosen to get only the node name, its role—whether a node is a data or client node— node load, and its uptime.

The response returned by Elasticsearch looks as follows:

name node.role load uptime

Alicia Masters d 6.09 6.7m

Siege d 6.09 1h

As you can see the /_cat/nodes REST endpoint provides all requested information about the nodes in the cluster.

Backing up

One of the most important tasks for the administrator is to make sure that no data will be lost in the case of a system failure. Elasticsearch, in its assumptions, is a resistant and well-configured cluster of nodes and can survive even a few simultaneous disasters. However, even the most properly configured cluster is vulnerable to network splits and network partitions, which in some very rare cases can result in data corruption or loss. In such cases, being able to get data restored from the backup is the only solution that can save us from recreating our indices. You probably already know what we want to talk about: the snapshot / restore functionality provided by Elasticsearch. However, as we said earlier, we don't want to repeat ourselves—this is a book for more advanced Elasticsearch users, and basics of the snapshot and restore API were already described in Elasticsearch Server Second Edition by Packt Publishing and in the official documentation. Now, we want to focus on the functionalities that were added after the release of Elasticsearch 1.0 and thus omitted in the previous book—let's talk about the cloud capabilities of the Elasticsearch backup functionality.

Saving backups in the cloud

The central concept of the snapshot / restore functionality is a repository. It is a place where the data—our indices and the related meta information—is safely stored (assuming that the storage is reliable and highly available). The assumption is that every node that is a part of the cluster has access to the repository and can both write to it and read from it. Because of the need for high availability and reliability, Elasticsearch, with the help of additional plugins, allows us to push our data outside of the cluster—to the cloud. There are three possibilities where our repository can be located, at least using officially supported plugins:

· The S3 repository: Amazon Web Services

· The HDFS repository: Hadoop clusters

· The Azure repository: Microsoft's cloud platform

Because we didn't discuss any of the plugins related to the snapshot / restore functionality, let's get through them to see where we can push our backup data.

The S3 repository

The S3 repository is a part of the Elasticsearch AWS plugin, so to use S3 as the repository for snapshotting, we need to install the plugin first:

bin/plugin -install elasticsearch/elasticsearch-cloud-aws/2.4.0

After installing the plugin on every Elasticsearch node in the cluster, we need to alter their configuration (the elasticsearch.yml file) so that the AWS access information is available. The example configuration can look like this:

cloud:

aws:

access_key: YOUR_ACCESS_KEY

secret_key: YOUT_SECRET_KEY

To create the S3 repository that Elasticsearch will use for snapshotting, we need to run a command similar to the following one:

curl -XPUT 'http://localhost:9200/_snapshot/s3_repository' -d '{

"type": "s3",

"settings": {

"bucket": "bucket_name"

}

}'

The following settings are supported when defining an S3-based repository:

· bucket: This is the required parameter describing the Amazon S3 bucket to which the Elasticsearch data will be written and from which Elasticsearch will read the data.

· region: This is the name of the AWS region where the bucket resides. By default, the US Standard region is used.

· base_path: By default, Elasticsearch puts the data in the root directory. This parameter allows you to change it and alter the place where the data is placed in the repository.

· server_side_encryption: By default, encryption is turned off. You can set this parameter to true in order to use the AES256 algorithm to store data.

· chunk_size: By default, this is set to 100m and specifies the size of the data chunk that will be sent. If the snapshot size is larger than chunk_size, Elasticsearch will split the data into smaller chunks that are not larger than the size specified in chunk_size.

· buffer_size: The size of this buffer is set to 5m (which is the lowest possible value) by default. When the chunk size is greater than the value of buffer_size, Elasticsearch will split it into buffer_size fragments and use the AWS multipart API to send it.

· max_retries: This specifies the number of retries Elasticsearch will take before giving up on storing or retrieving the snapshot. By default, it is set to 3.

In addition to the preceding properties, we are allowed to set two additional properties that can overwrite the credentials stored in elasticserch.yml, which will be used to connect to S3. This is especially handy when you want to use several S3 repositories—each with its own security settings:

· access_key: This overwrites cloud.aws.access_key from elasticsearch.yml

· secret_key: This overwrites cloud.aws.secret_key from elasticsearch.yml

The HDFS repository

If you use Hadoop and its HDFS (http://wiki.apache.org/hadoop/HDFS) filesystem, a good alternative to back up the Elasticsearch data is to store it in your Hadoop cluster. As with the case of S3, there is a dedicated plugin for this. To install it, we can use the following command:

bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.0.2

Note that there is an additional plugin version that supports Version 2 of Hadoop. In this case, we should append hadoop2 to the plugin name in order to be able to install the plugin. So for Hadoop 2, our command that installs the plugin would look as follows:

bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.0.2-hadoop2

There is also a lite version that can be used in a situation where Hadoop is installed on the system with Elasticsearch. In this case, the plugin does not contain Hadoop libraries and are already available to Elasticsearch. To install the lite version of the plugin, the following command can be used:

bin/plugin -i elasticsearch/elasticsearch-repository-hdfs/2.0.2-light

After installing the plugin on each Elasticsearch (no matter which version of the plugin was used) and restarting the cluster, we can use the following command to create a repository in our Hadoop cluster:

curl -XPUT 'http://localhost:9200/_snapshot/hdfs_repository' -d '{

"type": "hdfs"

"settings": {

"path": "snapshots"

}

}'

The available settings that we can use are as follows:

· uri: This is the optional parameter that tells Elasticsearch where HDFS resides. It should have a format like hdfs://HOST:PORT/.

· path: This is the information about the path where snapshot files should be stored. It is a required parameter.

· load_default: This specifies whether the default parameters from the Hadoop configuration should be loaded and set to false if the reading of the settings should be disabled.

· conf_location: This is the name of the Hadoop configuration file to be loaded. By default, it is set to extra-cfg.xml.

· chunk_size: This specifies the size of the chunk that Elasticsearch will use to split the snapshot data; by default, it is set to 10m. If you want the snapshotting to be faster, you can use smaller chunks and more streams to push the data to HDFS.

· conf.<key>: This is where key is any Hadoop argument. The value provided using this property will be merged with the configuration.

· concurrent_streams: By default, this is set to 5 and specifies the number of concurrent streams used by a single node to write and read to HDFS.

The Azure repository

The last of the repositories we wanted to mention is Microsoft's Azure cloud. Just like Amazon S3, we are able to use a dedicated plugin to push our indices and metadata to Microsoft cloud services. To do this, we need to install a plugin, which we can do by running the following command:

bin/plugin -install elasticsearch/elasticsearch-cloud-azure/2.4.0

The configuration is also similar to the Amazon S3 plugin configuration. Our elasticsearch.yml file should contain the following section:

cloud:

azure:

storage_account: YOUR_ACCOUNT

storage_key: YOUT_SECRET_KEY

After Elasticsearch is configured, we need to create the actual repository, which we do by running the following command:

curl -XPUT 'http://localhost:9200/_snapshot/azure_repository' -d '{

"type": "azure"

}'

The following settings are supported by the Elasticsearch Azure plugin:

· container: As with the bucket in Amazon S3, every piece of information must reside in the container. This setting defines the name of the container in the Microsoft Azure space. The default value is elasticserch-snapshots.

· base_path: This allows us to change the place where Elasticsearch will put the data. By default, Elasticsearch puts the data in the root directory.

· chunk_size: This is the maximum chunk size used by Elasticsearch (set to 64m by default, and this is also the maximum value allowed). You can change it to change the size when the data should be split into smaller chunks.

Federated search

Sometimes, having data in a single cluster is not enough. Imagine a situation where you have multiple locations where you need to index and search your data—for example, local company divisions that have their own clusters for their own data. The main center of your company would also like to search the data—not in each location but all at once. Of course, in your search application, you can connect to all these clusters and merge the results manually, but from Elasticsearch 1.0, it is also possible to use the so-called tribe node that works as a federated Elasticsearch client and can provide access to more than a single Elasticsearch cluster. What the tribe node does is fetch all the cluster states from the connected clusters and merge these states into one global cluster state available on the tribe node. In this section, we will take a look at tribe nodes and how to configure and use them.

Note

Remember that the described functionality was introduced in Elasticsearch 1.0 and is still marked as experimental. It can be changed or even removed in future versions of Elasticsearch.

The test clusters

For the purpose of showing you how tribe nodes work, we will create two clusters that hold data. The first cluster is named mastering_one (as you remember to set the cluster name, you need to specify the cluster.name property in the elasticsearch.yml file) and the second cluster is named mastering_two. To keep it as simple as it can get, each of the clusters contain only a single Elasticsearch node. The node in the cluster named mastering_one is available at the 192.168.56.10 IP address and the cluster named mastering_one is available at the 192.168.56.40 IP address.

Cluster one was indexed with the following documents:

curl -XPOST '192.168.56.10:9200/index_one/doc/1' -d '{"name" : "Test document 1 cluster 1"}'

curl -XPOST '192.168.56.10:9200/index_one/doc/2' -d '{"name" : "Test document 2 cluster 1"}'

For the second cluster the following data was indexed:

curl -XPOST '192.168.56.40:9200/index_two/doc/1' -d '{"name" : "Test document 1 cluster 2"}'

curl -XPOST '192.168.56.40:9200/index_two/doc/2' -d '{"name" : "Test document 2 cluster 2"}'

Creating the tribe node

Now, let's try to create a simple tribe node that will use the multicast discovery by default. To do this, we need a new Elasticsearch node. We also need to provide a configuration for this node that will specify which clusters our tribe node should connect together—in our case, these are our two clusters that we created earlier. To configure our tribe node, we need the following configuration in the elasticsearch.yml file:

tribe.mastering_one.cluster.name: mastering_one

tribe.mastering_two.cluster.name: mastering_two

All the configurations for the tribe node are prefixed with the tribe prefix. In the preceding configuration, we told Elasticsearch that we will have two tribes: one named mastering_one and the second one named mastering_two. These are arbitrary names that are used to distinguish the clusters that are a part of the tribe cluster.

We can start our tribe node, which we will start on a server with the 192.168.56.50 IP address. After starting Elasticsearch, we will try to use the default multicast discovery to find the mastering_one and mastering_two clusters and connect to them. You should see the following in the logs of the tribe node:

[2014-10-30 17:28:04,377][INFO ][cluster.service ] [Feron] added {[mastering_one_node_1][mGF6HHoORQGYkVTzuPd4Jw][ragnar][inet[/192.168.56.10:9300]]{tribe.name=mastering_one},}, reason: cluster event from mastering_one, zen-disco-receive(from master [[mastering_one_node_1][mGF6HHoORQGYkVTzuPd4Jw][ragnar][inet[/192.168.56.10:9300]]])

[2014-10-30 17:28:08,288][INFO ][cluster.service ] [Feron] added {[mastering_two_node_1][ZqvDAsY1RmylH46hqCTEnw][ragnar][inet[/192.168.56.40:9300]]{tribe.name=mastering_two},}, reason: cluster event from mastering_two, zen-disco-receive(from master [[mastering_two_node_1][ZqvDAsY1RmylH46hqCTEnw][ragnar][inet[/192.168.56.40:9300]]])

As we can see, our tribe node joins two clusters together.

Using the unicast discovery for tribes

Of course, multicast discovery is not the only possibility to connect multiple clusters together using the tribe node; we can also use the unicast discovery if needed. For example, to change our tribe node configuration to use unicast, we would change theelasticsearch.yml file to look as follows:

tribe.mastering_one.cluster.name: mastering_one

tribe.mastering_one.discovery.zen.ping.multicast.enabled: false

tribe.mastering_one.discovery.zen.ping.unicast.hosts: ["192.168.56.10:9300"]

tribe.mastering_two.cluster.name: mastering_two

tribe.mastering_two.discovery.zen.ping.multicast.enabled: false

tribe.mastering_two.discovery.zen.ping.unicast.hosts: ["192.168.56.40:9300"]

As you can see, for each tribe cluster, we disabled the multicast and we specified the unicast hosts. Also note the thing we already wrote about—each property for the tribe node is prefixed with the tribe prefix.

Reading data with the tribe node

We said in the beginning that the tribe node fetches the cluster state from all the connected clusters and merges it into a single cluster state. This is done in order to enable read and write operations on all the clusters when using the tribe node. Because the cluster state is merged, almost all operations work in the same way as they would on a single cluster, for example, searching.

Let's try to run a single query against our tribe now to see what we can expect. To do this, we use the following command:

curl -XGET '192.168.56.50:9200/_search?pretty'

The results of the preceding query look as follows:

{

"took" : 9,

"timed_out" : false,

"_shards" : {

"total" : 10,

"successful" : 10,

"failed" : 0

},

"hits" : {

"total" : 4,

"max_score" : 1.0,

"hits" : [ {

"_index" : "index_two",

"_type" : "doc",

"_id" : "1",

"_score" : 1.0,

"_source":{"name" : "Test document 1 cluster 2"}

}, {

"_index" : "index_one",

"_type" : "doc",

"_id" : "2",

"_score" : 1.0,

"_source":{"name" : "Test document 2 cluster 1"}

}, {

"_index" : "index_two",

"_type" : "doc",

"_id" : "2",

"_score" : 1.0,

"_source":{"name" : "Test document 2 cluster 2"}

}, {

"_index" : "index_one",

"_type" : "doc",

"_id" : "1",

"_score" : 1.0,

"_source":{"name" : "Test document 1 cluster 1"}

} ]

}

}

As you can see, we have documents coming from both clusters—yes, that's right; our tribe node was about to automatically get data from all the connected tribes and return the relevant results. We can, of course, do the same with more sophisticated queries; we can use percolation functionality, suggesters, and so on.

Master-level read operations

Read operations that require the master to be present, such as reading the cluster state or cluster health, will be performed on the tribe cluster. For example, let's look at what cluster health returns for our tribe node. We can check this by running the following command:

curl -XGET '192.168.56.50:9200/_cluster/health?pretty'

The results of the preceding command will be similar to the following one:

{

"cluster_name" : "elasticsearch",

"status" : "yellow",

"timed_out" : false,

"number_of_nodes" : 5,

"number_of_data_nodes" : 2,

"active_primary_shards" : 10,

"active_shards" : 10,

"relocating_shards" : 0,

"initializing_shards" : 0,

"unassigned_shards" : 10

}

As you can see, our tribe node reported 5 nodes to be present. We have a single node for each of the connected clusters: one tribe node and two internal nodes that are used to provide connectivity to the connected clusters. This is why there are 5 nodes and not three of them.

Writing data with the tribe node

We talked about querying and master-level read operations, so it is time to write some data to Elasticsearch using the tribe node. We won't say much; instead of talking about indexing, let's just try to index additional documents to one of our indices that are present on the connected clusters. We can do this by running the following command:

curl -XPOST '192.168.56.50:9200/index_one/doc/3' -d '{"name" : "Test document 3 cluster 1"}'

The execution of the preceding command will result in the following response:

{"_index":"index_one","_type":"doc","_id":"3","_version":1,"created":true}

As we can see, the document has been created and, what's more, it was indexed in the proper cluster. The tribe node just did its work by forwarding the request internally to the proper cluster. All the write operations that don't require the cluster state to change, such as indexing, will be properly executed using the tribe node.

Master-level write operations

Master-level write operations can't be executed on the tribe node—for example, we won't be able to create a new index using the tribe node. Operations such as index creation will fail when executed on the tribe node, because there is no global master present. We can test this easily by running the following command:

curl -XPOST '192.168.56.50:9200/index_three'

The preceding command will return the following error after about 30 seconds of waiting:

{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}

As we can see, the index was not created. We should run the master-level write commands on the clusters that are a part of the tribe.

Handling indices conflicts

One of the things that the tribe node can't handle properly is indices with the same names present in multiple connected clusters. What the Elasticsearch tribe node will do by default is that it will choose one and only one index with the same name. So, if all your clusters have the same index, only a single one will be chosen.

Let's test this by creating the index called test_conflicts on the mastering_one cluster and the same index on the mastering_two cluster. We can do this by running the following commands:

curl -XPOST '192.168.56.10:9200/test_conflicts'

curl -XPOST '192.168.56.40:9200/test_conflicts'

In addition to this, let's index two documents—one to each cluster. We do this by running the following commands:

curl -XPOST '192.168.56.10:9200/test_conflicts/doc/11' -d '{"name" : "Test conflict cluster 1"}'

curl -XPOST '192.168.56.40:9201/test_conflicts/doc/21' -d '{"name" : "Test conflict cluster 2"}'

Now, let's run our tribe node and try to run a simple search command:

curl -XGET '192.168.56.50:9202/test_conflicts/_search?pretty'

The output of the command will be as follows:

{

"took" : 1,

"timed_out" : false,

"_shards" : {

"total" : 5,

"successful" : 5,

"failed" : 0

},

"hits" : {

"total" : 1,

"max_score" : 1.0,

"hits" : [ {

"_index" : "test_conflicts",

"_type" : "doc",

"_id" : "11",

"_score" : 1.0,

"_source":{"name" : "Test conflict cluster 1"}

} ]

}

}

As you can see, we only got a single document in the result. This is because the Elasticsearch tribe node can't handle indices with the same names coming from different clusters and will choose only one index. This is quite dangerous, because we don't know what to expect.

The good thing is that we can control this behavior by specifying the tribe.on_conflict property in elasticsearch.yml (introduced in Elasticsearch 1.2.0). We can set it to one of the following values:

· any: This is the default value that results in Elasticsearch choosing one of the indices from the connected tribe clusters.

· drop: Elasticsearch will ignore the index and won't include it in the global cluster state. This means that the index won't be visible when using the cluster node (both for write and read operations) but still will be present on the connected clusters themselves.

· prefer_TRIBE_NAME: Elasticsearch allows us to choose the tribe cluster from which the indices should be taken. For example, if we set our property to prefer_mastering_one, it would mean that Elasticsearch will load the conflicting indices from the cluster namedmastering_one.

Blocking write operations

The tribe node can also be configured to block all write operations and all the metadata change requests. To block all the write operations, we need to set the tribe.blocks.write property to true. To disallow metadata change requests, we need to set thetribe.blocks.metadata property to true. By default, these properties are set to false, which means that write and metadata altering operations are allowed. Disallowing these operations can be useful when our tribe node should only be used for searching and nothing else.

In addition to this, Elasticsearch 1.2.0 introduced the ability to block write operations on defined indices. We do this by using the tribe.blocks.indices.write property and setting its value to the name of the indices. For example, if we want our tribe node to block write operations on all the indices starting with test and production, we set the following property in the elasticsearch.yml file of the tribe node:

tribe.blocks.indices.write: test*, production*

Summary

In this chapter, we focused more on the Elasticsearch configuration and new features that were introduced in Elasticsearch 1.0. We configured discovery and recovery, and we used the human-friendly Cat API. In addition to that, we used the backup and restore functionality, which allowed easy backup and recovery of our indices. Finally, we looked at what federated search is and how to search and index data to multiple clusters, while still using all the functionalities of Elasticsearch and being connected to a single node.

In the next chapter, we will focus on the performance side of Elasticsearch. We will start by optimizing our queries with filters. We will discuss the garbage collector work, and we will benchmark our queries with the new benchmarking capabilities of Elasticsearch. We will use warming queries to speed up the query execution time, and we will use the Hot Threads API to see what is happening inside Elasticsearch. Finally, we will discuss Elasticsearch scaling and prepare Elasticsearch for high indexing and querying use cases.