Administrating Your Cluster - Elasticsearch Server, Second Edition (2014)

Elasticsearch Server, Second Edition (2014)

Chapter 8. Administrating Your Cluster

In the previous chapter, we learned how node discovery works in Elasticsearch and how to tune it. We also learned about the recovery and gateway modules. We saw how to prepare our cluster for high indexing and querying use cases by tuning some of the Elasticsearch internals and what those internals are. Finally, we used index templates and dynamic mappings to easily control dynamic indices' structure. By the end of this chapter, you will learn the following aspects:

· Using Elasticsearch snapshotting functionality

· Monitoring our cluster using the Elasticsearch API

· Adjusting cluster rebalancing to match our needs

· Moving shards around by using the Elasticsearch API

· Warming up

· Using aliasing to ease the everyday work

· Installing the Elasticsearch plugins

· Using the Elasticsearch update settings API

The Elasticsearch time machine

A good piece of software is one that manages an exceptional situation such as hardware failure or human error. Even though a cluster of a few servers is less exposed to hardware problems, bad things can still happen. For example, let's imagine that you need to restore your indices. One possible solution is to reindex all your data from a primary data store as a SQL database. But what will you do if it takes too long or, even worse, the only data store is Elasticsearch? Before Elasticsearch 1.0, creating backups of indices was not easy. The procedure included shutdown of the cluster before copying the data files. Fortunately, now we can take snapshots. Let's see how this works.

Creating a snapshot repository

A snapshot keeps all the data related to the cluster from the time snapshot creation starts and it includes information about the cluster state and indices. Before we create snapshots, at least the first one, a snapshot repository must be created. Each repository is recognized by its name and should define the following aspects:

· name: This is a unique name of the repository; we will need it later.

· type: This is the type of the repository. The possible values are fs (repository on a shared filesystem) and url (read-only repository available via URL).

· settings: This is the additional information needed depending on the repository type.

Now, let's create a filesystem repository. Please note that every node in the cluster should be able to access this directory. To create a new filesystem repository, we can run a command shown as follows:

curl -XPUT localhost:9200/_snapshot/backup -d '{

"type": "fs",

"settings": {

"location": "/tmp/es_backup_folder/cluster1"

}

}'

The preceding command creates a repository named backup, which stores the backup files in the directory given by the location attribute. Elasticsearch responds with the following information:

{"acknowledged":true}

At the same time, backup_folder on the local filesystem is created—without any content yet.

Note

As we said, the second repository type is url. It requires a url parameter instead of location, which points to the address where the repository resides, for example, the HTTP address. You can also store snapshots in Amazon S3 or HDFS using the additional plugins available (see https://github.com/elasticsearch/elasticsearch-cloud-aws#s3-repository and https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs).

Now that we have our first repository, we can see its definition using the following command:

curl -XGET localhost:9200/_snapshot/backup?pretty

We can also check all the repositories by running a command like the following:

curl -XGET localhost:9200/_snapshot/_all?pretty

If you want to delete a snapshot repository, the standard DELETE command helps:

curl -XDELETE localhost:9200/_snapshot/backup?pretty

Creating snapshots

By default, Elasticsearch takes all the indices and cluster settings (except the transient ones) when creating snapshots. You can create any number of snapshots, and each will hold information available right from the time when the snapshot was created. The snapshots are created in a smart way; only new information is copied. It means that Elasticsearch knows which segments are already stored in the repository and doesn't save them again.

To create a new snapshot, we need to choose a unique name and use the following command:

curl -XPUT 'localhost:9200/_snapshot/backup/bckp1'

The preceding command defines a new snapshot named bckp1 (you can only have one snapshot with a given name; Elasticsearch will check its uniqueness), and data is stored in the previously defined backup repository. The command returns an immediate response, which looks as follows:

{"accepted":true}

The preceding response means that the process of snapshotting has started and continues in the background. If you would like the response to be returned only when the actual snapshot is created, we can add the wait_for_completion parameter as shown in the following example:

curl -XPUT 'localhost:9200/_snapshot/backup/bckp2?wait_for_completion=

true&pretty'

The response to the preceding command shows the status of a created snapshot:

{

"snapshot" : {

"snapshot" : "bckp2",

"indices" : [ "art" ],

"state" : "SUCCESS",

"start_time" : "2014-02-22T13:04:40.770Z",

"start_time_in_millis" : 1393074280770,

"end_time" : "2014-02-22T13:04:40.781Z",

"end_time_in_millis" : 1393074280781,

"duration_in_millis" : 11,

"failures" : [ ],

"shards" : {

"total" : 5,

"failed" : 0,

"successful" : 5

}

}

}

As we can see, Elasticsearch presents information about the time taken by the snapshotting process, its status, and the indices affected.

Additional parameters

The snapshot command also accepts the following additional parameters:

· indices: These are the names of the indices of which we want to take snapshots.

· ignore_unavailable: When this is set to false (the default is true), meaning that when the indices parameter points to the inexistent index, the command will fail.

· include_global_state: When this is set to true (the default), the cluster state is also written in the snapshot (except for the transient settings).

· partial: The snapshot success depends on the availability of all the shards. If any of the shards are not available, the snapshot fails. Setting partial to true causes Elasticsearch to save only the available shards and omit the lost ones.

An example of using additional parameters can look as follows:

curl -XPUT 'localhost:9200/_snapshot/backup/bckp?wait_for_completion=true&pretty' -d '{ "indices": "b*", "include_global_state": "false" }'

Restoring a snapshot

Now that we have our snapshots done, we will also learn how to restore data from a given snapshot. As we said earlier, a snapshot can be addressed by its name. We can list all the snapshots by using the following command:

curl -XGET 'localhost:9200/_snapshot/backup/_all?pretty'

The repository we created earlier is called backup. To restore a snapshot named bckp1 from our snapshot repository, run the following command:

curl -XPOST 'localhost:9200/_snapshot/backup/bckp1/_restore'

During the execution of this command, Elasticsearch takes indices defined in the snapshot and creates them with the data from the snapshot. However, if the index already exists and is not closed, the command will fail. In this case, you may find it convenient to only restore certain indices, for example:

curl -XPOST 'localhost:9200/_snapshot/backup/bck1/_restore?pretty' -d '{ "indices": "c*"}'

The preceding command restores only the indices that begin with the letter c. The other available parameters are as follows:

· ignore_unavailable: This is the same as in the snapshot creation.

· include_global_state: This is the same as in the snapshot creation.

· rename_pattern: This allows you to change the name of the index stored in the snapshot. Thanks to this, the restored index will have a different name. The value of this parameter is a regular expression that defines source index name. If a pattern matches the name of the index, the name substitution will occur. In the pattern, you should use groups limited by parentheses used in the rename_replacement parameter.

· rename_replacement: This along with rename_pattern defines the target index name. Using the dollar sign and number, you can recall the appropriate group from rename_pattern.

For example, due to rename_pattern=products_(.*), only the indices with names that begin with products_ will be restored. The rest of the index name will be used in replacement. Together with rename_replacement=items_$1 causes that products_cars index will be restored to an index called items_cars.

Cleaning up – deleting old snapshots

Elasticsearch leaves snapshot repository management up to you. Currently, there is no automatic cleanup process. But don't worry; this is simple. For example, let's remove our previously taken snapshot:

curl -XDELETE 'localhost:9200/_snapshot/backup/bckp1?pretty'

And that's all. The command causes the snapshot named bckp1 from the backup repository to be deleted.

Monitoring your cluster's state and health

During the normal life of an application, a very important aspect is monitoring. This allows the administrators of the system to detect possible problems, prevent them before they occur, or at least know what happens during a failure.

Elasticsearch provides very detailed information that allows you to check and monitor a node or the cluster as a whole. This includes statistics, information about servers, nodes, indices, and shards. Of course, we are also able to get information about the whole cluster state. Before we get into details about the mentioned API, please remember that the API is complex and we are only describing the basics. We will try to show you when to start, so you'll be able to know what to look for when you need very detailed information.

The cluster health API

One of the most basic APIs is the cluster health API that allows us to get information about the whole cluster state with a single HTTP command. For example, let's run the following command:

curl 'localhost:9200/_cluster/health?pretty'

A sample response returned by Elasticsearch for the preceding command looks as follows:

{

"cluster_name" : "es-book",

"status" : "green",

"timed_out" : false,

"number_of_nodes" : 1,

"number_of_data_nodes" : 1,

"active_primary_shards" : 4,

"active_shards" : 4,

"relocating_shards" : 0,

"initializing_shards" : 0,

"unassigned_shards" : 0

}

The most important information is the one about the status of the cluster. In our example, we see that the cluster is in the green status. This means that all the shards have been allocated properly and there were no errors.

Let's stop here and talk about the cluster and when, as a whole, it will be fully operational. The cluster is fully operational when Elasticsearch is able to allocate all the shards and replicas according to the configuration. When this happens, the cluster is in the green state. The yellow state means that we are ready to handle requests because the primary shards are allocated but some (or all) replicas are not. The last state, the red one, means that at least one primary shard was not allocated and because of this, the cluster is not ready yet. This means that the queries may return errors or incomplete results.

The preceding command can also be executed to check the health state of a certain index. For example, if we want to check the health of the library and map indices, we will run the following command:

curl 'localhost:9200/_cluster/health/library,map/?pretty'

Controlling information details

Elasticsearch allows us to specify a special level parameter that can take the value of cluster (default), indices, or shards. This allows us to control the details of the information returned by the health API. We've already seen the default behavior. When setting thelevel parameter to indices, apart from the cluster information, we will also get the per index health. Setting the mentioned parameter to shards tells Elasticsearch to return per shard information in addition to what we've seen in the example.

Additional parameters

In addition to the level parameter, we have a few additional parameters that can control the behavior of the health API.

The first of the mentioned parameters is timeout. It allows us to control the maximum time for which the command execution will wait. By default, it is set to 30s and means that the health command will wait for a maximum of 30 seconds and will return the response by then.

The wait_for_status parameter allows us to tell Elasticsearch which health status the cluster should be at to return the command. It can take the values green, yellow, and red. For example, when set to green, the health API call returns the results until the green status or timeout is reached.

The wait_for_nodes parameter allows us to set the required number of nodes available to return the health command response (or until a defined timeout is reached). It can be set to an integer number like 3 or to a simple equation like >=3 (means more than or equal to three nodes) or <=3 (means less than or equal to three nodes).

The last parameter is wait_for_relocating_shard, which is not specified by default. It allows us to tell Elasticsearch how many relocating shards it should wait for (or until the timeout is reached). Setting this parameter to 0 means that Elasticsearch should wait for all the relocating shards.

An example usage of the health command with some of the mentioned parametersis as follows:

curl 'localhost:9200/_cluster/health?wait_for_status=green&wait_for_nodes=>=3&timeout=100s'

The indices stats API

Elasticsearch index is the place where our data lives, and it is a crucial part for most deployments. With the use of the indices stats API available using the _stats endpoint, we can get various information about the indices living inside our cluster. Of course, as with most of the APIs in Elasticsearch, we can give a command to get information about all the indices (using the pure _stats endpoint), about one particular index (for example, library/_stats), or several indices at the same time (for example, library,map/_stats). For example, to check the statistics for the map and library indices we've used in the book, we could run the following command:

curl localhost:9200/library,map/_stats?pretty

The response to the preceding command has more than 500 lines, so we only describe its structure and omit the response itself. Apart from the information about the response status and the response time, we can see three objects named primaries, total, andindices. The indices object contains information about the library and map indices. The primaries object contains information about the primary shards allocated on the current node, and the total object contains information about all the shards including replicas. All these objects can contain objects that describe a particular statistic, such as docs, store, indexing, get, search, merges, refresh, flush, warmer, filter_cache, id_cache, fielddata, percolate, completion, segments, and translog. Let's discuss the information stored in these objects.

Docs

The docs section of the response shows information about the indexed documents. For example, it could look as follows:

"docs" : {

"count" : 4,

"deleted" : 0

}

The main information is the count, indicating the number of documents. When we delete documents from the index, Elasticsearch doesn't remove these documents immediately but only marks them as deleted. The documents are physically deleted in a segment merge process. The number of documents marked as deleted are presented as a deleted attribute and should be 0 right after the merge.

Store

The next set of statistics, store, provides information regarding storage. For example, such a section could look as follows:

"store" : {

"size_in_bytes" : 6003,

"throttle_time_in_millis" : 0

}

The main information is about the index (or indices) size. We can also look at throttling statistics. This information is useful when the system has problems with the I/O performance and has configured limits on internal operation during segments merge.

Indexing, get, and search

The indexing, get, and search sections of the response provide information about data manipulation: indexing with delete operations, using real-time get, and searching. Let's look at the following example returned by Elasticsearch:

"indexing" : {

"index_total" : 11501,

"index_time_in_millis" : 4574,

"index_current" : 0,

"delete_total" : 0,

"delete_time_in_millis" : 0,

"delete_current" : 0

},

"get" : {

"total" : 3,

"time_in_millis" : 0,

"exists_total" : 2,

"exists_time_in_millis" : 0,

"missing_total" : 1,

"missing_time_in_millis" : 0,

"current" : 0

},

"search" : {

"query_total" : 0,

"query_time_in_millis" : 0,

"query_current" : 0,

"fetch_total" : 0,

"fetch_time_in_millis" : 0,

"fetch_current" : 0

}

As we can see, all of these statistics have a similar structure. We can read the total time spent in the various request types (in milliseconds) and the number of requests that the total time allows you to calculate the average time of a single query. In the case of real time, the get requests valuable information is the number of fetches that were unsuccessful (missing documents).

Additional information

In addition, Elasticsearch provides the following information:

· merges: This section contains information about Lucene segment merges

· refresh: This section contains information about refresh operations

· flush: This section contains information about flushes

· warmer: This section contains information about warmers and how long they were executed

· filter_cache: These are the filter cache statistics

· id_cache: These are the identifiers cache statistics

· fielddata: These are the field data cache statistics

· percolate: This section contains information about the percolator usage

· completion: This section contains information about the completion suggester

· segments: This section contains information about Lucene segments

· translog: This section contains information about transaction logs count and size

The status API

Another way to obtain information about indices is the status API available by using the _status endpoint. The information that is returned describes the available shards and includes information on which shard is currently considered primary, which node it is assigned to, which node it is reallocated to (if it is), status of the shard (is it active or not), information about the transaction log and the merge process, and the refresh and flush statistics.

The nodes info API

The nodes info API provides us with the information about the nodes in the cluster. To get information from this API, we need to send the request to the _nodes REST endpoints.

This API can be used to fetch information about particular nodes or a single node using the following aspects:

· Node name: If we want to get information about the node named Pulse, we can run a command to the _nodes/Pulse REST endpoint

· Node identifier: If we want to get information about the node with an identifier equal to ny4hftjNQtuKMyEvpUdQWg, we can run a command to the _nodes/ny4hftjNQtuKMyEvpUdQWg REST endpoint

· IP address: If we want to get information about the node with an IP address equal to 192.168.1.103, we can run a command to the _nodes/192.168.1.103 REST endpoint

· Parameters from the Elasticsearch configuration: If we want to get information about all the nodes with the node.rack property set to 2, we can run a command to the /_nodes/rack:2 REST endpoint

This API also allows you to get information about several nodes at once by usingthe following:

· Patterns, for example, _nodes/192.168.1.* or _nodes/P*

· Nodes enumeration, for example, _nodes/Pulse,Slab

· Both patterns and enumerations, for example, /_nodes/P*,S*

By default, the request to the nodes API will return the basic information about a node, such as name, identifier, and its address. But by adding additional parameters, we can obtain other information. The available parameters are as follows:

· settings: This parameter is used to get Elasticsearch configuration

· os: This parameter is used to get information about the server, such as processor, RAM, and swap space

· process: This parameter is used to get the process identifier and the available file descriptors

· jvm: This parameter is used to get information about the Java virtual machine (JVM), such as the memory limit

· thread_pool: This parameter is used to get the configuration of the thread pools for various operations

· network: This parameter is used to get name and addresses of the network interface

· transport: This parameter is used to get the listen addresses for transport

· http: This parameter is used to get the listen addresses for HTTP

· plugins: This parameter is used to get information about the installed plugins

An example usage of the earlier described API can be illustrated by the following command:

curl 'localhost:9200/_nodes/Pulse/os,jvm,plugins?pretty'

The preceding command will return information related to the operating system, Java virtual machine, and plugins in addition to the basic information. Of course, all the information will be about the nodes named Pulse.

The nodes stats API

The nodes stats API is similar to the nodes info API described previously. The main difference is that the previous API provides information about the environment, and the one we are currently discussing tells us what happened with the cluster during its work. To use the nodes stats API, one needs to send a command to the /_nodes/stats REST endpoint. However, similar to the nodes info API, we can also retrieve information about specific nodes (for example, _nodes/Pulse/stats).

By default, Elasticsearch returns all the available statistics, but we can limit it to the ones we are interested in. The available options are as follows:

· indices: This provides information about the indices, including size, document count, indexing-related statistics, search and get time, caches, segment merges, and so on

· os: This provides operating system related information, such as free disk space, memory, and swap usage

· process: This provides information about the memory, CPU, and file handler usage related to the Elasticsearch process

· jvm: This provides information about the Java virtual machine memory and garbage collector statistics

· network: This provides information about the TCP-level information

· transport: This provides information about data sent and received by the transport module

· http: This provides information about HTTP connections

· fs: This provides information about the available disk space and I/O operations statistics

· thread_pool: This provides information about the state of the threads assigned to various operations

· breaker: This provides information about the field data cache circuit breaker

An example usage of the previously described API can be illustrated by the following command:

curl 'localhost:9200/_nodes/Pulse/stats/os,jvm,breaker?pretty'

The cluster state API

Another API provided by Elasticsearch is the cluster state API. As the name suggests, it allows us to get information about the whole cluster. (We can also limit the returned information to a local node by adding the local=true parameter to the request.) The basic command used to get all the information retuned by the discussed API looks as follows:

curl 'localhost:9200/_cluster/state?pretty'

However, we can also limit the provided information to the given metrics (separated by commas and specified after the _cluster/state part of the REST call) and to the given indices (again separated by commas and specified after the _cluster/state/metrics part of the REST call). An example call that would only return node related information about the map and library indices could look as follows:

curl 'localhost:9200/_cluster/state/nodes/map,library?pretty'

The following metrics can be used:

· version: This returns information about the cluster state version.

· master_node: This returns information about the elected master node.

· nodes: This returns information on nodes.

· routing_table: This returns routing-related information.

· metadata: This returns metadata-related information. When specifying the retrieval of the metadata metric, we can also include an additional parameter, index_templates=true, which will result in the defined index templates being included.

· blocks: This returns the blocks part of the response.

The pending tasks API

One of the APIs introduced in Elasticsearch 1.0 is the pending tasks API, which allows us to check which tasks are waiting to be executed. To retrieve this information, we need to send a request to the /_cluster/pending_tasks REST endpoint. In the response, we will see an array of tasks with information about them, such as task priority and time in queue.

The indices segments API

The last API we wanted to mention is the Lucene segments API available by using the /_segments endpoint. We can run it for the whole cluster and for individual indices too. This API provides information about shards, their placement, and the segments connected with the physical index managed by the Lucene library.

The cat API

Of course, we may say that all the information we need to diagnose and observe the cluster can be retrieved by using the provided API. However, the response returned by the API is in JSON; great, but not especially convenient to use at least for a human being. That's why Elasticsearch allows us to use a friendlier API: the cat API.

To use the cat API, one needs to send a request to the _cat REST endpoint followed by one of the options, which are as follows:

· aliases: This returns information about aliases (we'll learn about aliases in the Index aliasing and using it to simplify your everyday work section of this chapter)

· allocation: This returns information about the allocated shards and disk usage

· count: This returns information about the document count for all the indices or an individual one

· health: This returns information about cluster health

· indices: This returns information about all the indices or an individual one

· master: This returns information about the elected master node

· nodes: This returns information about the cluster topology

· pending_tasks : This returns information about the tasks that are waiting to be executed

· recovery: This provides a view of the recovery process

· thread_pool: This provides cluster wide statistics regarding thread pools

· shards: This returns information about shards

This may be a bit confusing, so let's have a look at an example command that would return information about shards. The command that would allow us to get this information is given as follows:

curl -XGET 'localhost:9200/_cat/shards?v'

Note

Note that we've included the v parameter to the request. This means that we want the information to be more verbose, for example, including the header. In addition to the v parameter, we can also use the help parameter, which will return the header's description for a given command and the h parameter, which accepts a comma-separated list of the columns we want to include in the response.

And the response to the preceding command will look as follows:

index shard prirep state docs store ip node

map 0 p STARTED 4 5.9kb 192.168.1.40 es_node_1

library 0 p STARTED 9 11.8kb 192.168.56.1 es_node_2

We can see that we have two indices, each with a single shard. We can also see the ID of the shard, that is, whether it is a primary shard, its state, number of documents, its size, node IP address, and the node name.

Limiting returned information

Some of the cat API commands allow us to limit the information they return. For example, the aliases call allows us to get information about a specific alias by appending the alias name just like in the following command:

curl -XGET 'localhost:9200/_cat/aliases/current_index'

Let's summarize the commands that allow information limiting:

· aliases: This limits the information to a specific alias by appending the alias in the request

· count: This limits the information to a specific index by appending the index name we are interested in to the request

· indices: This limits the information to a specific index just like in the count command

· shards: This limits the information to a specific index by appending the index name we are interested in

Controlling cluster rebalancing

By default, Elasticsearch tries to keep the shards and their replicas evenly balanced across the cluster. Such behavior is good in most cases, but there are times when we would want to control this behavior. In this section, we will look at how to avoid cluster rebalance and how to control this process' behavior in depth.

Imagine a situation where you know that your network can handle a very high amount of traffic, or the opposite; your network is used extensively and you want to avoid too much stress on it. The other example is that you may want to decrease the pressure that is put on your I/O subsystem after a full-cluster restart, and you want to have less shards and replicas being initialized at the same time. These are only two examples where rebalance control may be handy.

Rebalancing

Rebalancing is the process of moving shards between the different nodes in your cluster. As we have already mentioned, it is fine in most situations, but sometimes you may want to completely avoid this. For example, if we define how our shards are placed and we want to keep it that way, we want to avoid rebalancing. However, by default, Elasticsearch will try to rebalance the cluster whenever the cluster state changes and Elasticsearch thinks rebalance is needed.

Cluster being ready

We already know that our indices can be built of shards and replicas. Primary shards or just shards are the ones that are used when the new documents are indexed, there is an update or delete, or just in case of any index change. We also have replicas that get the data from the primary shards.

You can think of the cluster as being ready to be used when all the primary shards are assigned to their nodes in your cluster—as soon as the yellow health state is achieved. However, Elasticsearch may still initialize other shards: the replicas. However, you can use your cluster, and be sure that you can search your whole data set and you can send index change commands. Then, those will be processed properly.

The cluster rebalance settings

Elasticsearch lets us control the rebalance process with the use of a few properties that can be set in the elasticsearch.yml file or by using the Elasticsearch REST API (described in the The update settings API section of this chapter).

Controlling when rebalancing will start

The cluster.routing.allocation.allow_rebalance property allows us to specify when rebalancing will be started. This property can take the following values:

· always: This value indicates that rebalancing will start as soon as it's needed

· indices_primaries_active: This value indicates that rebalancing will start when all the primary shards are initialized

· indices_all_active: This is the default value, which means that rebalancing will start when all the shards and replicas are initialized

Controlling the number of shards being moved between nodes concurrently

The cluster.routing.allocation.cluster_concurrent_rebalance property allows us to specify how many shards can be moved between nodes at once in the whole cluster. If you have a cluster that is built of many nodes, you can increase this value. This value defaults to 2.

Controlling the number of shards initialized concurrently on a single node

The cluster.routing.allocation.node_concurrent_recoveries property lets us set the number of shards that Elasticsearch is allowed to initialize on a single node at once. Please note that the shard recovery process is very I/O intensive, so you'll probably want to avoid too many shards being recovered concurrently. This value defaults to the same value as the previous one: 2.

Controlling the number of primary shards initialized concurrently on a single node

The cluster.routing.allocation.node_initial_primaries_recoveries property lets us control how many primary shards are allowed to be concurrently initialized on a node.

Controlling types of shards allocation

By using the cluster.routing.allocation.enable property, we can control what kind of shards are allowed to be allocated. The mentioned property can take the following values:

· all: This is the default value, which tells Elasticsearch that all types of shards are allowed to be allocated

· primaries: This tells Elasticsearch that it should only allocate primary shards and leave the replicas that are not allocated

· new_primaries: This tells Elasticsearch that only the newly created primary shards can be allocated

· none: This disables shard allocation completely

Controlling the number of concurrent streams on a single node

The indices.recovery.concurrent_streams property allows us to control how many streams are allowed to be opened on a node at once, in order to recover a shard from the target shards. It defaults to 3. If you know that your network and nodes can handle more, you can increase the value.

Controlling the shard and replica allocation

Indices that live inside your Elasticsearch cluster can be built of many shards, and each shard can have many replicas. With the ability to have multiple shards of a single index, we can deal with indices that are too large to fit on a single machine. The reasons may be different, from memory and CPU related to storage ones. With the ability to have multiple replicas of each shard, we can handle a higher query load by spreading the replicas over multiple servers. We can say that by using shards and replicas, we can scale out Elasticsearch. However, Elasticsearch has to figure out where in the cluster it should place the shards and replicas. It needs to figure out which server/nodes each shard or replica should be placed on.

Explicitly controlling allocation

Imagine that we want to have our indices to be placed on different cluster nodes. For example, we want one index named shop to be placed on some nodes and the second index called users to be placed on other nodes. Finally, the last index called promotions to be placed on all the nodes that users and shop indices were placed on. We may want to do this because of performance reasons. We know that some of the servers on which we installed Elasticsearch are more powerful than the others. With the default Elasticsearch behavior, we can't be sure where the shards and replicas will be placed, but luckily, Elasticsearch allows us to control that.

Specifying node parameters

So let's divide our cluster into two zones. We say zones, but it can be any name you want; we just like zone. We will assume that we have four nodes. We want our more powerful nodes numbered 1 and 2 to be placed in a zone called zone_one, and the nodes numbered 3 and 4, which are smaller in terms of resources, to be placed in a zone called zone_two.

Configuration

To achieve our described indices distribution, we add the following property to the elasticsearch.yml configuration file on node 1 and 2 (the nodes that are more powerful):

node.zone: zone_one

Of course, we will add a similar property to the elasticsearch.yml configuration file on node 3 and 4 (the less powerful nodes):

node.zone: zone_two

Index creation

Now let's create our indices. First, let's create the shop index. Place this index on the more powerful nodes. We can do this by running the following commands:

curl -XPUT 'http://localhost:9200/shop' -d '{

"settings" : {

"index" : {

"routing.allocation.include.zone" : "zone_one"

}

}

}'

The preceding command will result in the shop index being created and the index.routing.allocation.include.zone property being specified to it. We set this property to the zone_one value, which means that we want to place the shop index on the nodes that have thenode.zone property set to zone_one.

We perform similar steps for the users index:

curl -XPUT 'http://localhost:9200/users' -d '{

"settings" : {

"index" : {

"routing.allocation.include.zone" : "zone_two"

}

}

}'

However, this time we've specified that we want the users index to be placed on the nodes with the node.zone property set to zone_two.

Finally, the promotions index should be placed in all the preceding nodes, so we will use the following command to create and configure this index:

curl -XPOST 'http://localhost:9200/promotions'

curl -XPUT 'http://localhost:9200/promotions/_settings' -d '{

"index.routing.allocation.include.zone" : "zone_one,zone_two"

}'

This time we've used a different set of commands. The first one creates the index, and the second one updates the index.routing.allocation.include.zone property. We did this just to illustrate that it can be done in such a way.

Excluding nodes from allocation

In the same manner as how we specified on which nodes the index should be placed, we can also exclude nodes from index allocation. Referring to the previously shown example, if we would like the index called pictures to not be placed on nodes with thenode.zone property set to zone_one, we would run the following command:

curl -XPUT 'localhost:9200/pictures/_settings' -d '{

"index.routing.allocation.exclude.zone" : "zone_one"

}'

Notice that instead of the index.routing.allocation.include.zone property, we've used the index.routing.allocation.exclude.zone property.

Requiring node attributes

In addition to inclusion and exclusion rules, we can also specify the rules that must match for a shard to be allocated to a given node. The difference is that when using the index.routing.allocation.include property, the index will be placed on any node that matches at least one of the provided property values. By using the index.routing.allocation.require property, Elasticsearch will place the index on a node that has all the defined values. For example, let's assume that we've set the following settings for the pictures index:

curl -XPUT 'localhost:9200/pictures/_settings' -d '{

"index.routing.allocation.require.size" : "big_node",

"index.routing.allocation.require.zone" : "zone_one"

}'

After running the preceding command, Elasticsearch would only place the shards of the pictures index on a node with the node.size property set to big_node and the node.zone property set to big_node.

Using IP addresses for shard allocation

Instead of adding a special parameter to the nodes' configuration, we can use IP addresses to specify which nodes we want to include or exclude from the shards and replicas allocation. To do this instead of using the zone part of theindex.routing.allocation.include.zone or index.routing.allocation.exclude.zone properties, we use _ip. For example, if we would like our shop index to be placed only on the nodes with IP address 10.1.2.10 and 10.1.2.11, we will run the following command:

curl -XPUT 'localhost:9200/shop/_settings' -d '{

"index.routing.allocation.include._ip" : "10.1.2.10,10.1.2.11"

}'

Disk-based shard allocation

In addition to the already described allocation filtering methods, Elasticsearch 1.0 brings one additional method: the disk-based one. It allows us to set allocation rules based on the node's disk usage, so we won't run out of disk space or something similar.

Enabling disk-based shard allocation

The disk-based shard allocation is disabled by default. We can enable it by specifying the cluster.routing.allocation.disk.threshold_enabled property and setting it to true. We can do this in the elasticsearch.yml file or by dynamically using the cluster settings API (you can read about it in the The update settings API section of this chapter):

curl -XPUT localhost:9200/_cluster/settings -d '{

"transient" : {

"cluster.routing.allocation.disk.threshold_enabled" : true

}

}'

Configuring disk-based shard allocation

There are three properties that control the behavior of disk-based shard allocation. All of them can be updated dynamically or set in the elasticsearch.yml configuration file.

The first of these is cluster.info.update.interval, which is by default set to 30 seconds and defines how often Elasticsearch updates information about disk usage on the nodes.

The second property is cluster.routing.allocation.disk.watermark.low, which is by default set to 0.70. This means that Elasticsearch will not allocate new shards to a node that uses more than 70 percent of its disk space.

The third property is cluster.routing.allocation.disk.watermark.high, which controls when Elasticsearch will start relocating shards from a given node. It defaults to 0.85 and means that Elasticsearch will start reallocating shards when the disk usage on a given node is equal to or more than 85 percent.

Both the cluster.routing.allocation.disk.watermark.low and cluster.routing.allocation.disk.watermark.high properties can be set to a percentage value (such as 0.60, meaning 60 percent) and to an absolute value (such as 600mb, meaning 600 megabytes).

Cluster wide allocation

Instead of specifying allocation inclusion and exclusion at the index level (which we did until now), we can do that for all the indices in our cluster. For example, if we would like to place all new indices on the nodes with the IP addresses 10.1.2.10 and 10.1.2.11, we will run the following command:

curl -XPUT 'localhost:9200/_cluster/settings' -d '{

"transient" : {

"cluster.routing.allocation.include._ip" : "10.1.2.10,10.1.2.11"

}

}'

Notice that the command was sent to the _cluster/settings REST endpoint instead of the INDEX_NAME/_settings endpoint. Of course, we can use both include and exclude and require rules just as we did on the index level.

Please note that the transient and persistent cluster properties were discussed in the Controlling cluster rebalancing section earlier in this chapter.

Number of shards and replicas per node

In addition to specifying shards and replica allocation, we are also allowed to specify the maximum number of shards that can be placed on a single node for a single index. For example, if we would like our shop index to have only a single shard per node, we will run the following command:

curl -XPUT 'localhost:9200/shop/_settings' -d '{

"index.routing.allocation.total_shards_per_node" : 1

}'

This property can be placed in the elasticsearch.yml file or can be updated on live indices using the preceding command. Please remember that your cluster can stay in the red state if Elasticsearch is not able to allocate all the primary shards.

Moving shards and replicas manually

The last thing we want to discuss is the ability to manually move shards between nodes. This may be useful, for example, if you want to bring a single node down, but before doing this, you want to move all the shards from that node. Elasticsearch exposes the_cluster/reroute REST endpoint, which allows us to control this. The following operations are available:

· Moving a shard from node to node

· Canceling shard allocation

· Forcing shard allocation

Now let's look closer at all of the preceding operations.

Moving shards

Let's say we have two nodes called es_node_one and es_node_two. In addition to this, we have two shards of the shop index placed by Elasticsearch on the first node. Now, we would like to move the second shard to the second node. In order to do this, we can run the following command:

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{

"commands" : [ {

"move" : {

"index" : "shop",

"shard" : 1,

"from_node" : "es_node_one",

"to_node" : "es_node_two"

}

} ]

}'

We've specified the move command, which allows us to move shards (and replicas) of the index specified by the index property. The shard property is the number of shards we want to move. And finally, the from_node property specifies the name of the node we want tomove the shard from, and the to_node property specifies the name of the node we want the shard to be placed on.

Canceling shard allocation

If we would like to cancel an ongoing allocation process, we can run the cancel command and specify the index, node, and shard we want to cancel the allocation for. For example, consider the following command:

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{

"commands" : [ {

"cancel" : {

"index" : "shop",

"shard" : 0,

"node" : "es_node_one"

}

} ]

}'

The preceding command will cancel the allocation of the shard 0 of the shop index on the es_node_one node.

Forcing shard allocation

In addition to canceling and moving shards and replicas, we can also allocate an unallocated shard to a specific node. For example, if we have an unallocated shard numbered 0 for the users index and we would like to allocate it to es_node_two by Elasticsearch, we will run the following command:

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{

"commands" : [ {

"allocate" : {

"index" : "users",

"shard" : 0,

"node" : "es_node_two"

}

} ]

}'

Multiple commands per HTTP request

We can, of course, include multiple commands in a single HTTP request. For example, consider the following command:

curl -XPOST 'localhost:9200/_cluster/reroute' -d '{

"commands" : [

{"move" : {"index" : "shop", "shard" : 1, "from_node" : "es_node_one", "to_node" : "es_node_two"}},

{"cancel" : {"index" : "shop", "shard" : 0, "node" : "es_node_one"}}

]

}'

Warming up

Sometimes, there may be a need to prepare Elasticsearch in order to handle your queries. Maybe it's because you heavily rely on the field data cache and you want it to be loaded before your production queries arrive or maybe you want to warm up your operating system's I/O cache. Whatever the reason, Elasticsearch allows us to define the warming queries for our types and indices.

Defining a new warming query

A warming query is nothing more than the usual query stored in a special index called _warmer in Elasticsearch. Let's assume that we have the following query that we want to use for warming up:

{

"query" : {

"match_all" : {}

},

"facets" : {

"warming_facet" : {

"terms" : {

"field" : "tags"

}

}

}

}

To store the preceding query as a warming query for our library index, we will run the following command:

curl -XPUT 'localhost:9200/library/_warmer/tags_warming_query' -d '{

"query" : {

"match_all" : {}

},

"facets" : {

"warming_facet" : {

"terms" : {

"field" : "tags"

}

}

}

}'

The preceding command will register our query as a warming query with the name tags_warming_query. You can have multiple warming queries for your index, but each of these queries needs to have a unique name.

We can not only define warming queries for the whole index, but also for the specific type in it. For example, to store our previously shown query as the warming query only for the book type in the library index, run the preceding command not to the /library/_warmerURI but to /library/book/_warmer. So, the entire command will be as follows:

curl -XPUT 'localhost:9200/library/book/_warmer/tags_warming_query' -d '{

"query" : {

"match_all" : {}

},

"facets" : {

"warming_facet" : {

"terms" : {

"field" : "tags"

}

}

}

}'

After adding a warming query, before Elasticsearch allows a new segment to be searched on, it will be warmed up by running the defined warming queries on that segment. It allows Elasticsearch and the operating system to cache data and thus, speed up searching.

Note

Just as we read in the Full-text searching section of Chapter 1, Getting Started with the Elasticsearch Cluster, Lucene divides the index into parts called segments, which once written can't be changed. Every new commit operation creates a new segment (which is eventually merged if the number of segments is too high), which Lucene uses for searching.

Retrieving the defined warming queries

In order to get a specific warming query for our index, we just need to know its name. For example, if we want to get the warming query named tags_warming_query for our library index, we will run the following command:

curl -XGET 'localhost:9200/library/_warmer/tags_warming_query?pretty=true'

The result returned by Elasticsearch will be as follows (note that we've used the pretty=true parameter to make the response easier to read):

{

"library" : {

"warmers" : {

"tags_warming_query" : {

"types" : [ ],

"source" : {

"query" : {

"match_all" : { }

},

"facets" : {

"warming_facet" : {

"terms" : {

"field" : "tags"

}

}

}

}

}

}

}

}

We can also get all the warming queries for the index and type by using the following command:

curl -XGET 'localhost:9200/library/_warmer'

And finally, we can also get all the warming queries that start with the given prefix. For example, if we want to get all the warming queries for the library index that start with the tags prefix, we will run the following command:

curl -XGET 'localhost:9200/library/_warmer/tags*'

Deleting a warming query

Deleting a warming query is very similar to getting one; we just need to use the DELETE HTTP method. To delete a specific warming query from our index, we just need to know its name. For example, if we want to delete the warming query named tags_warming_query for our library index, we will run the following command:

curl -XDELETE 'localhost:9200/library/_warmer/tags_warming_query'

We can also delete all the warming queries for the index by using the following command:

curl -XDELETE 'localhost:9200/library/_warmer/_all'

And finally, we can also remove all the warming queries that start with the given prefix. For example, if we want to remove all the warming queries for the library index that start with the tags prefix, we will run the following command:

curl -XDELETE 'localhost:9200/library/_warmer/tags*'

Disabling the warming up functionality

To disable the warming queries totally, but to save them in the _warmer index, you should set the index.warmer.enabled configuration property to false (setting this property to true will result in enabling the warming up functionality). This setting can be either put into theelasticsearch.yml file or just set using the REST API on a live cluster.

For example, if we want to disable the warming up functionality for the library index, we will run the following command:

curl -XPUT 'http://localhost:9200/library/_settings' -d '{

"index.warmer.enabled" : false

}'

Choosing queries

You may ask which queries should be used as the warming queries; typically, you'll want to choose the ones that are expensive to execute and the ones that require caches to be populated. So you'll probably want to choose the queries that include faceting and sorting based on the fields in your index. In addition to this, parent-child queries and the ones that include common filters may also be the ones to consider. You may also choose other queries by looking at the logs, finding where your performance is not as great as you want it to be. Such queries may also be perfect candidates for warming up.

For example, let's say that we have the following logging configuration set in the elasticsearch.yml file:

index.search.slowlog.threshold.query.warn: 10s

index.search.slowlog.threshold.query.info: 5s

index.search.slowlog.threshold.query.debug: 2s

index.search.slowlog.threshold.query.trace: 1s

And, we have the following logging level set in the logging.yml configuration file:

logger:

index.search.slowlog: TRACE, index_search_slow_log_file

Notice that the index.search.slowlog.threshold.query.trace property is set to 1s, and the index.search.slowlog logging level is set to TRACE. This means that whenever a query is executed for longer than one second (on a shard, not in total), it will be logged into the slow logfile (the name of which is specified by the index_search_slow_log_file configuration section of the logging.yml configuration file). For example, the following can be found in a slow logfile:

[2013-01-24 13:33:05,518][TRACE][index.search.slowlog.query] [Local test] [library][1] took[1400.7ms], took_millis[1400], search_type[QUERY_THEN_FETCH], total_shards[32], source[{"query":{"match_all":{}}}], extra_source[]

As you can see, in the preceding log line, we have the query time, search type, and the query source itself, which shows us the executed query.

Of course, the values can be different in your configuration, but the slow log can be a valuable source of the queries that have been running too long and may need to have some warm up defined—maybe these are parent-child queries and need some identifiers fetched to perform better, or maybe you are using a filter that is expensive when you execute it for the first time?

Note

There is one thing you should remember: don't overload your Elasticsearch cluster with too many warming queries because you may end up spending too much time warming up instead of processing your production queries.

Index aliasing and using it to simplify your everyday work

When working with multiple indices in Elasticsearch, you can sometimes lose track of them. Imagine a situation where you store logs in your indices. Usually, the amount of log messages is quite large, and therefore, it is a good solution to have the data divided somehow. A logical division of such data is obtained by creating a single index for a single day of logs (if you are interested in an open source solution used to manage logs, look at the Logstash at http://logstash.net). But after some time, if we keep all the indices, we will start to have a problem in taking care of all that. An application needs to take care of all the information, such as which index to send data to, which to query, and so on. With the help of aliases, we can change this to work with a single name just as we would use a single index, but we will work with multiple indices.

An alias

What is an index alias? It's an additional name for one or more indices that allow us to query indices with the use of that name. A single alias can have multiple indices as well as the other way around; a single index can be a part of multiple aliases.

However, please remember that you can't use an alias that has multiple indices for indexing or for real-time GET operations. Elasticsearch will throw an exception if you do that. We can still use an alias that links to only a single index for indexing, though. This is because Elasticsearch doesn't know in which index the data should be indexed or from which index the document should be fetched.

Creating an alias

To create an index alias, we need to run the HTTP POST method to the _aliases REST endpoint with a defined action. For example, the following request will create a new alias called week12 that will include the indices named day10, day11, and day12:

curl -XPOST 'localhost:9200/_aliases' -d '{

"actions" : [

{ "add" : { "index" : "day10", "alias" : "week12" } },

{ "add" : { "index" : "day11", "alias" : "week12" } },

{ "add" : { "index" : "day12", "alias" : "week12" } }

]

}'

If the alias week12 isn't present in our Elasticsearch cluster, the preceding command will create it. If it is present, the command will just add the specified indices to it.

We would run a search across the three indices as follows:

curl -XGET 'localhost:9200/day10,day11,day12/_search?q=test'

If everything goes well, we can instead run it as follows:

curl -XGET 'localhost:9200/week12/_search?q=test'

Isn't this better?

Modifying aliases

Of course, you can also remove indices from an alias. We can do this similar to how we add indices to an alias, but instead of the add command, we use the remove one. For example, to remove the index named day9 from the week12 index, we will run the following command:

curl -XPOST 'localhost:9200/_aliases' -d '{

"actions" : [

{ "remove" : { "index" : "day9", "alias" : "week12" } }

]

}'

Combining commands

The add and remove commands can be sent as a single request. For example, if you would like to combine all the previously sent commands into a single request, we will have to send the following command:

curl -XPOST 'localhost:9200/_aliases' -d '{

"actions" : [

{ "add" : { "index" : "day10", "alias" : "week12" } },

{ "add" : { "index" : "day11", "alias" : "week12" } },

{ "add" : { "index" : "day12", "alias" : "week12" } },

{ "remove" : { "index" : "day9", "alias" : "week12" } }

]

}'

Retrieving all aliases

In addition to adding or removing indices to or from aliases, we and our applications that use Elasticsearch may need to retrieve all the aliases available in the cluster or all the aliases that an index is connected to. To retrieve these aliases, we send a request using the HTTP GET command. For example, the following command gets all the aliases for the day10 index, and the second one will get all the aliases available:

curl -XGET 'localhost:9200/day10/_aliases'

curl -XGET 'localhost:9200/_aliases'

The response from the second command is as follows:

{

"day10" : {

"aliases" : {

"week12" : { }

}

},

"day11" : {

"aliases" : {

"week12" : { }

}

},

"day12" : {

"aliases" : {

"week12" : { }

}

}

}

Removing aliases

You can also remove an alias using the _alias endpoint. For example, sending the following command will remove the client alias from the data index:

curl -XDELETE localhost:9200/data/_alias/client

Filtering aliases

Aliases can be used in a way similar to how views are used in SQL databases. You can use a full Query DSL (discussed in detail in Chapter 2, Indexing Your Data) and have your query applied to all count, search, delete by query, and so on.

Let's look at an example. Imagine that we want to have aliases that return data for a certain client, so we can use it in our application. Let's say that the client identifier we are interested in is stored in the clientId field, and we are interested in client 12345. So, let's create the alias named client with our data index, which will apply a filter for clientId automatically:

curl -XPOST 'localhost:9200/_aliases' -d '{

"actions" : [

{

"add" : {

"index" : "data",

"alias" : "client",

"filter" : { "term" : { "clientId" : "12345" } }

}

} ]

}'

So, when using the defined alias, you will always get your request filtered by a term query that ensures that all the documents have the 12345 value in the clientId field.

Aliases and routing

Similar to aliases that use filtering, we can add routing values to the aliases. Imagine that we are using routing on the basis of the user identifier, and we want to use the same routing values with our aliases. So, for the alias named client, we will use the routing value of 12345, 12346, 12347 for querying, and only 12345 for indexing. To do this, we will create an alias using the following command:

curl -XPOST 'localhost:9200/_aliases' -d '{

"actions" : [

{

"add" : {

"index" : "data",

"alias" : "client",

"search_routing" : "12345,12346,12347",

"index_routing" : "12345"

}

} ]

}'

This way, when we index our data by using the client alias, the values specified by the index_routing property will be used. At the time of query, the ones specified by the search_routing property will be used.

There is one more thing. Please look at the following query sent to the preceding defined alias:

curl -XGET 'localhost:9200/client/_search?q=test&routing=99999,12345'

The value used as a routing value will be 12345. This is because Elasticsearch will take the common values of the search_routing attribute and the query routing parameter, which in our case is 12345.

Elasticsearch plugins

At various places in this book, we have used different Elasticsearch plugins. You probably remember the additional programming languages used in scripts and support for the attachments described in the Handling files section of Chapter 6, Beyond Full-text Searching. In this section, we will look at how plugins work and how to install them.

The basics

Elasticsearch plugins are located in their own subdirectory in the plugins directory. If you have downloaded a new plugin from a site, you can just create a new directory with the plugin name and unpack that plugin archive to this directory. There is also a more convenient way to install plugins: by using the plugin script. We have used it several times in this book, so this is the time to describe this tool.

Elasticsearch has two main types of plugins. These two types can be categorized based on their content: Java plugins and site plugins. Elasticsearch treats the site plugins as a file set that should be served by the built-in HTTP server under the/_plugin/plugin_name/ URL (for example, /_plugin/bigdesk/). In addition, every plugin without Java content is automatically treated as a site plugin. That's all. From Elasticsearch's point of view, a site plugin doesn't change anything in Elasticsearch's behavior.

Java plugins usually contain the .jar files that are scanned for the es-plugin.properties file. This file contains information about the main class that should be used by Elasticsearch as an entry point to configure plugins and allow them to extend the Elasticsearch functionality. The Java plugins can contain the site part that will be used by the built-in HTTP server (just like with the site plugins). This part of the plugin needs to be placed in the _site directory.

Installing plugins

By default, plugins are downloaded from the download.elasticsearch.org website. If the plugin is not available in this location, Maven Central (http://search.maven.org/), Maven Sonatype (https://repository.sonatype.org/), and GitHub (https://github.com/) repositoriesare checked. The plugin tool assumes that the given plugin address contains the organization name followed by the plugin name and version number. Let's look at the following command example:

bin/plugin -install elasticsearch/elasticsearch-lang-javascript/2.0.0.RC1

The preceding command results in the installation of a plugin that allows us to use additional scripting language: JavaScript. We choose Version 2.0.0.RC1 of this plugin. We can also omit the version number; in such cases, Elasticsearch will try to find a version equal to the Elasticsearch version or the latest master version of the plugin.

Just so we know what to expect, this is an example result of running the preceding command:

-> Installing elasticsearch/elasticsearch-lang-javascript/2.0.0.RC1...

Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-lang-javascript/elasticsearch-lang-javascript-2.0.0.RC1.zip...

Downloading ......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE

Installed elasticsearch/elasticsearch-lang-javascript/2.0.0.RC1 into /opt/elasticsearch-1.0.0/plugins/lang-javascript

If you write your own plugin and you have no access to the earlier-mentioned sites, there is no problem. The plugin tool also provides the –url option that allows us to set any location for the plugins including the local filesystem (using the file:// prefix). For example, the following command will result in the installation of a plugin archived on the local filesystem at /tmp/elasticsearch-lang-javascript-2.0.0.RC1.zip:

bin/plugin -install lang-javascript -url file:///tmp/elasticsearch-lang-javascript-2.0.0.RC1.zip

Removing plugins

Removing a plugin is as simple as removing its directory. You can also do this by using the plugin tool. For example, to remove the river-mongodb plugin, we can run a command as follows:

bin/plugin -remove river-mongodb

Note

You need to restart the Elasticsearch node for the plugin installation or removal to take effect.

The update settings API

Elasticsearch lets us tune it by specifying various parameters in the elasticsearch.yml file. But you should treat this file as the set of default values that can be changed in the runtime using the Elasticsearch REST API.

In order to set one of the properties, we need to use the HTTP PUT method and send a proper request to the _cluster/settings URI. However, we have two options: transient and permanent property settings.

The first one, transient, will set the property only until the first restart. In order to do this, we will send the following command:

curl -XPUT 'localhost:9200/_cluster/settings' -d '{

"transient" : {

"PROPERTY_NAME" : "PROPERTY_VALUE"

}

}'

As you can see, in the preceding command, we used the object named transient and we added our property definition there. This means that the property will be valid only until the restart. If we want our property settings to persist between restarts, instead of using the object named transient, we will use the one named persistent.

In every moment, you can fetch these settings using the following command:

curl -XGET localhost:9200/_cluster/settings

Summary

In this chapter, we learned how to create backups of our cluster; we created a backups repository, we created backups, and we managed them. In addition to this, we learned how to monitor our cluster using the Elasticsearch API, what the cat API is, and why it is more convenient for usage from a human perspective. We also controlled shard allocation, learned how to move a shard around the cluster, and controlled cluster rebalancing. We used the warmers functionality to prepare the cluster for production queries, and we saw how aliasing can help to manage the data in our cluster. Finally, we looked at what Elasticsearch plugins are and how to use the update settings API that Elasticsearch provides.

We have reached the end of the book. We hope that it was a nice reading experience and that you found the book interesting. We really hope that you have learned something from this book, and now, you will find it easier to use Elasticsearch every day. As the authors of this book and Elasticsearch users, we tried to bring you, our readers, the best reading experience we could. Of course, Elasticsearch is more than what we have described in the book—especially when it comes to monitoring and administration capabilities and API. However, the number of pages is limited, and if we describe everything in great detail, we would end up with a book that is one thousand pages long. Also, we are afraid we wouldn't be able to write about everything in enough detail. We need to remember that Elasticsearch is not only user friendly, but also provides a large amount of configuration options, querying possibilities, and so on. Due to this, we had to choose which functionality had to be described in greater detail, which had to be only mentioned, and which had to be totally skipped. We hope that the our choices regarding the topics were right

We would also like to say that it is worth remembering that Elasticsearch is constantly evolving. When writing this book, we went through a few stable versions finally making it to the release of 1.0.0 and 1.0.1. Even back then, we knew that new features and improvements are will come. Be sure to check www.elasticsearch.org periodically for the release notes for new versions of Elasticsearch if you want to be up to date with the new features being added. We will also be writing about new features that we think are worth mentioning on www.elasticsearchserverbook.com. So if you are interested, do visit this site from time to time.