Elasticsearch: The Definitive Guide (2015)

Part VII. Administration, Monitoring, and Deployment

Chapter 46. Post-Deployment

Once you have deployed your cluster in production, there are some tools and best practices to keep your cluster running in top shape. In this short chapter, we talk about configuring settings dynamically, tweaking logging levels, improving indexing performance, and backing up your cluster.

Changing Settings Dynamically

Many settings in Elasticsearch are dynamic and can be modified through the API. Configuration changes that force a node (or cluster) restart are strenuously avoided. And while it’s possible to make the changes through the static configs, we recommend that you use the API instead.

The cluster-update API operates in two modes:

Transient

These changes are in effect until the cluster restarts. Once a full cluster restart takes place, these settings are erased.

Persistent

These changes are permanently in place unless explicitly changed. They will survive full cluster restarts and override the static configuration files.

Transient versus persistent settings are supplied in the JSON body:

PUT /_cluster/settings

{

"persistent" : {

"discovery.zen.minimum_master_nodes" : 2

"transient" : {

"indices.store.throttle.max_bytes_per_sec" : "50mb"

}

This persistent setting will survive full cluster restarts.

This transient setting will be removed after the first full cluster restart.

A complete list of settings that can be updated dynamically can be found in the online reference docs.

Logging

Elasticsearch emits a number of logs, which are placed in ES_HOME/logs. The default logging level is INFO. It provides a moderate amount of information, but is designed to be rather light so that your logs are not enormous.

When debugging problems, particularly problems with node discovery (since this often depends on finicky network configurations), it can be helpful to bump up the logging level to DEBUG.

You could modify the logging.yml file and restart your nodes—but that is both tedious and leads to unnecessary downtime. Instead, you can update logging levels through the cluster-settings API that we just learned about.

To do so, take the logger you are interested in and prepend logger. to it. Let’s turn up the discovery logging:

PUT /_cluster/settings

{

"transient" : {

"logger.discovery" : "DEBUG"

}

While this setting is in effect, Elasticsearch will begin to emit DEBUG-level logs for the discovery module.

TIP

Avoid TRACE. It is extremely verbose, to the point where the logs are no longer useful.

Slowlog

There is another log called the slowlog. The purpose of this log is to catch queries and indexing requests that take over a certain threshold of time. It is useful for hunting down user-generated queries that are particularly slow.

By default, the slowlog is not enabled. It can be enabled by defining the action (query, fetch, or index), the level that you want the event logged at (WARN, DEBUG, and so forth) and a time threshold.

This is an index-level setting, which means it is applied to individual indices:

PUT /my_index/_settings

{

"index.search.slowlog.threshold.query.warn" : "10s",

"index.search.slowlog.threshold.fetch.debug": "500ms",

"index.indexing.slowlog.threshold.index.info": "5s"

}

Emit a WARN log when queries are slower than 10s.

Emit a DEBUG log when fetches are slower than 500ms.

Emit an INFO log when indexing takes longer than 5s.

You can also define these thresholds in your elasticsearch.yml file. Indices that do not have a threshold set will inherit whatever is configured in the static config.

Once the thresholds are set, you can toggle the logging level like any other logger:

PUT /_cluster/settings

{

"transient" : {

"logger.index.search.slowlog" : "DEBUG",

"logger.index.indexing.slowlog" : "WARN"

}

Set the search slowlog to DEBUG level.

Set the indexing slowlog to WARN level.

Indexing Performance Tips

If you are in an indexing-heavy environment, such as indexing infrastructure logs, you may be willing to sacrifice some search performance for faster indexing rates. In these scenarios, searches tend to be relatively rare and performed by people internal to your organization. They are willing to wait several seconds for a search, as opposed to a consumer facing a search that must return in milliseconds.

Because of this unique position, certain trade-offs can be made that will increase your indexing performance.

THESE TIPS APPLY ONLY TO ELASTICSEARCH 1.3+

This book is written for the most recent versions of Elasticsearch, although much of the content works on older versions.

The tips presented in this section, however, are explicitly for version 1.3+. There have been multiple performance improvements and bugs fixed that directly impact indexing. In fact, some of these recommendations will reduce performance on older versions because of the presence of bugs or performance defects.

Test Performance Scientifically

Performance testing is always difficult, so try to be as scientific as possible in your approach. Randomly fiddling with knobs and turning on ingestion is not a good way to tune performance. If there are too many causes, it is impossible to determine which one had the best effect. A reasonable approach to testing is as follows:

1. Test performance on a single node, with a single shard and no replicas.

2. Record performance under 100% default settings so that you have a baseline to measure against.

3. Make sure performance tests run for a long time (30+ minutes) so you can evaluate long-term performance, not short-term spikes or latencies. Some events (such as segment merging, and GCs) won’t happen right away, so the performance profile can change over time.

4. Begin making single changes to the baseline defaults. Test these rigorously, and if performance improvement is acceptable, keep the setting and move on to the next one.

Using and Sizing Bulk Requests

This should be fairly obvious, but use bulk indexing requests for optimal performance. Bulk sizing is dependent on your data, analysis, and cluster configuration, but a good starting point is 5–15 MB per bulk. Note that this is physical size. Document count is not a good metric for bulk size. For example, if you are indexing 1,000 documents per bulk, keep the following in mind:

§ 1,000 documents at 1 KB each is 1 MB.

§ 1,000 documents at 100 KB each is 100 MB.

Those are drastically different bulk sizes. Bulks need to be loaded into memory at the coordinating node, so it is the physical size of the bulk that is more important than the document count.

Start with a bulk size around 5–15 MB and slowly increase it until you do not see performance gains anymore. Then start increasing the concurrency of your bulk ingestion (multiple threads, and so forth).

Monitor your nodes with Marvel and/or tools such as iostat, top, and ps to see when resources start to bottleneck. If you start to receive EsRejectedExecutionException, your cluster can no longer keep up: at least one resource has reached capacity. Either reduce concurrency, provide more of the limited resource (such as switching from spinning disks to SSDs), or add more nodes.

NOTE

When ingesting data, make sure bulk requests are round-robined across all your data nodes. Do not send all requests to a single node, since that single node will need to store all the bulks in memory while processing.

Storage

Disks are usually the bottleneck of any modern server. Elasticsearch heavily uses disks, and the more throughput your disks can handle, the more stable your nodes will be. Here are some tips for optimizing disk I/O:

§ Use SSDs. As mentioned elsewhere, they are superior to spinning media.

§ Use RAID 0. Striped RAID will increase disk I/O, at the obvious expense of potential failure if a drive dies. Don’t use mirrored or parity RAIDS since replicas provide that functionality.

§ Alternatively, use multiple drives and allow Elasticsearch to stripe data across them via multiple path.data directories.

§ Do not use remote-mounted storage, such as NFS or SMB/CIFS. The latency introduced here is antithetical to performance.

§ If you are on EC2, beware of EBS. Even the SSD-backed EBS options are often slower than local instance storage.

Segments and Merging

Segment merging is computationally expensive, and can eat up a lot of disk I/O. Merges are scheduled to operate in the background because they can take a long time to finish, especially large segments. This is normally fine, because the rate of large segment merges is relatively rare.

But sometimes merging falls behind the ingestion rate. If this happens, Elasticsearch will automatically throttle indexing requests to a single thread. This prevents a segment explosion problem, in which hundreds of segments are generated before they can be merged. Elasticsearch will logINFO-level messages stating now throttling indexing when it detects merging falling behind indexing.

Elasticsearch defaults here are conservative: you don’t want search performance to be impacted by background merging. But sometimes (especially on SSD, or logging scenarios), the throttle limit is too low.

The default is 20 MB/s, which is a good setting for spinning disks. If you have SSDs, you might consider increasing this to 100–200 MB/s. Test to see what works for your system:

PUT /_cluster/settings

{

"persistent" : {

"indices.store.throttle.max_bytes_per_sec" : "100mb"

}

If you are doing a bulk import and don’t care about search at all, you can disable merge throttling entirely. This will allow indexing to run as fast as your disks will allow:

PUT /_cluster/settings

{

"transient" : {

"indices.store.throttle.type" : "none"

}

Setting the throttle type to none disables merge throttling entirely. When you are done importing, set it back to merge to reenable throttling.

If you are using spinning media instead of SSD, you need to add this to your elasticsearch.yml:

index.merge.scheduler.max_thread_count: 1

Spinning media has a harder time with concurrent I/O, so we need to decrease the number of threads that can concurrently access the disk per index. This setting will allow max_thread_count + 2 threads to operate on the disk at one time, so a setting of 1 will allow three threads.

For SSDs, you can ignore this setting. The default is Math.min(3, Runtime.getRuntime().availableProcessors() / 2), which works well for SSD.

Finally, you can increase index.translog.flush_threshold_size from the default 200 MB to something larger, such as 1 GB. This allows larger segments to accumulate in the translog before a flush occurs. By letting larger segments build, you flush less often, and the larger segments merge less often. All of this adds up to less disk I/O overhead and better indexing rates.

Other

Finally, there are some other considerations to keep in mind:

§ If you don’t need near real-time accuracy on your search results, consider dropping the index.refresh_interval of each index to 30s. If you are doing a large import, you can disable refreshes by setting this value to -1 for the duration of the import. Don’t forget to reenable it when you are finished!

§ If you are doing a large bulk import, consider disabling replicas by setting index.number_of_replicas: 0. When documents are replicated, the entire document is sent to the replica node and the indexing process is repeated verbatim. This means each replica will perform the analysis, indexing, and potentially merging process.

In contrast, if you index with zero replicas and then enable replicas when ingestion is finished, the recovery process is essentially a byte-for-byte network transfer. This is much more efficient than duplicating the indexing process.

§ If you don’t have a natural ID for each document, use Elasticsearch’s auto-ID functionality. It is optimized to avoid version lookups, since the autogenerated ID is unique.

§ If you are using your own ID, try to pick an ID that is friendly to Lucene. Examples include zero-padded sequential IDs, UUID-1, and nanotime; these IDs have consistent, sequential patterns that compress well. In contrast, IDs such as UUID-4 are essentially random, which offer poor compression and slow down Lucene.

Rolling Restarts

There will come a time when you need to perform a rolling restart of your cluster—keeping the cluster online and operational, but taking nodes offline one at a time.

The common reason is either an Elasticsearch version upgrade, or some kind of maintenance on the server itself (such as an OS update, or hardware). Whatever the case, there is a particular method to perform a rolling restart.

By nature, Elasticsearch wants your data to be fully replicated and evenly balanced. If you shut down a single node for maintenance, the cluster will immediately recognize the loss of a node and begin rebalancing. This can be irritating if you know the node maintenance is short term, since the rebalancing of very large shards can take some time (think of trying to replicate 1TB—even on fast networks this is nontrivial).

What we want to do is tell Elasticsearch to hold off on rebalancing, because we have more knowledge about the state of the cluster due to external factors. The procedure is as follows:

1. If possible, stop indexing new data. This is not always possible, but will help speed up recovery time.

2. Disable shard allocation. This prevents Elasticsearch from rebalancing missing shards until you tell it otherwise. If you know the maintenance window will be short, this is a good idea. You can disable allocation as follows:

3. PUT /_cluster/settings

4. {

5. "transient" : {

6. "cluster.routing.allocation.enable" : "none"

7. }

}

8. Shut down a single node, preferably using the shutdown API on that particular machine:

POST /_cluster/nodes/_local/_shutdown

9. Perform a maintenance/upgrade.

10.Restart the node, and confirm that it joins the cluster.

11.Reenable shard allocation as follows:

12.PUT /_cluster/settings

13.{

14. "transient" : {

15. "cluster.routing.allocation.enable" : "all"

16. }

}

Shard rebalancing may take some time. Wait until the cluster has returned to status green before continuing.

17.Repeat steps 2 through 6 for the rest of your nodes.

18.At this point you are safe to resume indexing (if you had previously stopped), but waiting until the cluster is fully balanced before resuming indexing will help to speed up the process.

Backing Up Your Cluster

As with any software that stores data, it is important to routinely back up your data. Elasticsearch replicas provide high availability during runtime; they allow you to tolerate sporadic node loss without an interruption of service.

Replicas do not provide protection from catastrophic failure, however. For that, you need a real backup of your cluster—a complete copy in case something goes wrong.

To back up your cluster, you can use the snapshot API. This will take the current state and data in your cluster and save it to a shared repository. This backup process is “smart.” Your first snapshot will be a complete copy of data, but all subsequent snapshots will save the delta between the existing snapshots and the new data. Data is incrementally added and deleted as you snapshot data over time. This means subsequent backups will be substantially faster since they are transmitting far less data.

To use this functionality, you must first create a repository to save data. There are several repository types that you may choose from:

§ Shared filesystem, such as a NAS

§ Amazon S3

§ HDFS (Hadoop Distributed File System)

§ Azure Cloud

Creating the Repository

Let’s set up a shared filesystem repository:

PUT _snapshot/my_backup

{

"type": "fs",

"settings": {

"location": "/mount/backups/my_backup"

}

We provide a name for our repository, in this case it is called my_backup.

We specify that the type of the repository should be a shared filesystem.

And finally, we provide a mounted drive as the destination.

NOTE

The shared filesystem path must be accessible from all nodes in your cluster!

This will create the repository and required metadata at the mount point. There are also some other options that you may want to configure, depending on the performance profile of your nodes, network, and repository location:

max_snapshot_bytes_per_sec

When snapshotting data into the repo, this controls the throttling of that process. The default is 20mb per second.

max_restore_bytes_per_sec

When restoring data from the repo, this controls how much the restore is throttled so that your network is not saturated. The default is 20mb per second.

Let’s assume we have a very fast network and are OK with extra traffic, so we can increase the defaults:

POST _snapshot/my_backup/

{

"type": "fs",

"settings": {

"location": "/mount/backups/my_backup",

"max_snapshot_bytes_per_sec" : "50mb",

"max_restore_bytes_per_sec" : "50mb"

}

Note that we are using a POST instead of PUT. This will update the settings of the existing repository.

Then add our new settings.

Snapshotting All Open Indices

A repository can contain multiple snapshots. Each snapshot is associated with a certain set of indices (for example, all indices, some subset, or a single index). When creating a snapshot, you specify which indices you are interested in and give the snapshot a unique name.

Let’s start with the most basic snapshot command:

PUT _snapshot/my_backup/snapshot_1

This will back up all open indices into a snapshot named snapshot_1, under the my_backup repository. This call will return immediately, and the snapshot will proceed in the background.

TIP

Usually you’ll want your snapshots to proceed as a background process, but occasionally you may want to wait for completion in your script. This can be accomplished by adding a wait_for_completion flag:

PUT _snapshot/my_backup/snapshot_1?wait_for_completion=true

This will block the call until the snapshot has completed. Note that large snapshots may take a long time to return!

Snapshotting Particular Indices

The default behavior is to back up all open indices. But say you are using Marvel, and don’t really want to back up all the diagnostic .marvel indices. You just don’t have enough space to back up everything.

In that case, you can specify which indices to back up when snapshotting your cluster:

PUT _snapshot/my_backup/snapshot_2

{

"indices": "index_1,index_2"

}

This snapshot command will now back up only index1 and index2.

Listing Information About Snapshots

Once you start accumulating snapshots in your repository, you may forget the details relating to each—particularly when the snapshots are named based on time demarcations (for example, backup_2014_10_28).

To obtain information about a single snapshot, simply issue a GET reguest against the repo and snapshot name:

GET _snapshot/my_backup/snapshot_2

This will return a small response with various pieces of information regarding the snapshot:

{

"snapshots": [

{

"snapshot": "snapshot_1",

"indices": [

".marvel_2014_28_10",

"index1",

"index2"

"state": "SUCCESS",

"start_time": "2014-09-02T13:01:43.115Z",

"start_time_in_millis": 1409662903115,

"end_time": "2014-09-02T13:01:43.439Z",

"end_time_in_millis": 1409662903439,

"duration_in_millis": 324,

"failures": [],

"shards": {

"total": 10,

"failed": 0,

"successful": 10

}

]

}

For a complete listing of all snapshots in a repository, use the _all placeholder instead of a snapshot name:

GET _snapshot/my_backup/_all

Deleting Snapshots

Finally, we need a command to delete old snapshots that are no longer useful. This is simply a DELETE HTTP call to the repo/snapshot name:

DELETE _snapshot/my_backup/snapshot_2

It is important to use the API to delete snapshots, and not some other mechanism (such as deleting by hand, or using automated cleanup tools on S3). Because snapshots are incremental, it is possible that many snapshots are relying on old segments. The delete API understands what data is still in use by more recent snapshots, and will delete only unused segments.

If you do a manual file delete, however, you are at risk of seriously corrupting your backups because you are deleting data that is still in use.

Monitoring Snapshot Progress

The wait_for_completion flag provides a rudimentary form of monitoring, but really isn’t sufficient when snapshotting or restoring even moderately sized clusters.

Two other APIs will give you more-detailed status about the state of the snapshotting. First you can execute a GET to the snapshot ID, just as we did earlier get information about a particular snapshot:

GET _snapshot/my_backup/snapshot_3

If the snapshot is still in progress when you call this, you’ll see information about when it was started, how long it has been running, and so forth. Note, however, that this API uses the same threadpool as the snapshot mechanism. If you are snapshotting very large shards, the time between status updates can be quite large, since the API is competing for the same threadpool resources.

A better option is to poll the _status API:

GET _snapshot/my_backup/snapshot_3/_status

The _status API returns immediately and gives a much more verbose output of statistics:

{

"snapshots": [

{

"snapshot": "snapshot_3",

"repository": "my_backup",

"state": "IN_PROGRESS",

"shards_stats": {

"initializing": 0,

"started": 1,

"finalizing": 0,

"done": 4,

"failed": 0,

"total": 5

"stats": {

"number_of_files": 5,

"processed_files": 5,

"total_size_in_bytes": 1792,

"processed_size_in_bytes": 1792,

"start_time_in_millis": 1409663054859,

"time_in_millis": 64

"indices": {

"index_3": {

"shards_stats": {

"initializing": 0,

"started": 0,

"finalizing": 0,

"done": 5,

"failed": 0,

"total": 5

"stats": {

"number_of_files": 5,

"processed_files": 5,

"total_size_in_bytes": 1792,

"processed_size_in_bytes": 1792,

"start_time_in_millis": 1409663054859,

"time_in_millis": 64

"shards": {

"0": {

"stage": "DONE",

"stats": {

"number_of_files": 1,

"processed_files": 1,

"total_size_in_bytes": 514,

"processed_size_in_bytes": 514,

"start_time_in_millis": 1409663054862,

"time_in_millis": 22

}

...

A snapshot that is currently running will show IN_PROGRESS as its status.

This particular snapshot has one shard still transferring (the other four have already completed).

The response includes the overall status of the snapshot, but also drills down into per-index and per-shard statistics. This gives you an incredibly detailed view of how the snapshot is progressing. Shards can be in various states of completion:

INITIALIZING

The shard is checking with the cluster state to see whether it can be snapshotted. This is usually very fast.

STARTED

Data is being transferred to the repository.

FINALIZING

Data transfer is complete; the shard is now sending snapshot metadata.

DONE

Snapshot complete!

FAILED

An error was encountered during the snapshot process, and this shard/index/snapshot could not be completed. Check your logs for more information.

Canceling a Snapshot

Finally, you may want to cancel a snapshot or restore. Since these are long-running processes, a typo or mistake when executing the operation could take a long time to resolve—and use up valuable resources at the same time.

To cancel a snapshot, simply delete the snapshot while it is in progress:

DELETE _snapshot/my_backup/snapshot_3

This will halt the snapshot process. Then proceed to delete the half-completed snapshot from the repository.

Restoring from a Snapshot

Once you’ve backed up some data, restoring it is easy: simply add _restore to the ID of the snapshot you wish to restore into your cluster:

POST _snapshot/my_backup/snapshot_1/_restore

The default behavior is to restore all indices that exist in that snapshot. If snapshot_1 contains five indices, all five will be restored into our cluster. As with the snapshot API, it is possible to select which indices we want to restore.

There are also additional options for renaming indices. This allows you to match index names with a pattern, and then provide a new name during the restore process. This is useful if you want to restore old data to verify its contents, or perform some other processing, without replacing existing data. Let’s restore a single index from the snapshot and provide a replacement name:

POST /_snapshot/my_backup/snapshot_1/_restore

{

"indices": "index_1",

"rename_pattern": "index_(.+)",

"rename_replacement": "restored_index_$1"

}

Restore only the index_1 index, ignoring the rest that are present in the snapshot.

Find any indices being restored that match the provided pattern.

Then rename them with the replacement pattern.

This will restore index_1 into your cluster, but rename it to restored_index_1.

TIP

Similar to snapshotting, the restore command will return immediately, and the restoration process will happen in the background. If you would prefer your HTTP call to block until the restore is finished, simply add the wait_for_completion flag:

POST _snapshot/my_backup/snapshot_1/_restore?wait_for_completion=true

Monitoring Restore Operations

The restoration of data from a repository piggybacks on the existing recovery mechanisms already in place in Elasticsearch. Internally, recovering shards from a repository is identical to recovering from another node.

If you wish to monitor the progress of a restore, you can use the recovery API. This is a general-purpose API that shows the status of shards moving around your cluster.

The API can be invoked for the specific indices that you are recovering:

GET /_recovery/restored_index_3

Or for all indices in your cluster, which may include other shards moving around, unrelated to your restore process:

GET /_recovery/

The output will look similar to this (and note, it can become very verbose depending on the activity of your clsuter!):

{

"restored_index_3" : {

"shards" : [ {

"id" : 0,

"type" : "snapshot",

"stage" : "index",

"primary" : true,

"start_time" : "2014-02-24T12:15:59.716",

"stop_time" : 0,

"total_time_in_millis" : 175576,

"source" : {

"repository" : "my_backup",

"snapshot" : "snapshot_3",

"index" : "restored_index_3"

"target" : {

"id" : "ryqJ5lO5S4-lSFbGntkEkg",

"hostname" : "my.fqdn",

"ip" : "10.0.1.7",

"name" : "my_es_node"

"index" : {

"files" : {

"total" : 73,

"reused" : 0,

"recovered" : 69,

"percent" : "94.5%"

"bytes" : {

"total" : 79063092,

"reused" : 0,

"recovered" : 68891939,

"percent" : "87.1%"

"total_time_in_millis" : 0

"translog" : {

"recovered" : 0,

"total_time_in_millis" : 0

"start" : {

"check_index_time" : 0,

"total_time_in_millis" : 0

}

} ]

}

The type field tells you the nature of the recovery; this shard is being recovered from a snapshot.

The source hash describes the particular snapshot and repository that is being recovered from.

The percent field gives you an idea about the status of the recovery. This particular shard has recovered 94% of the files so far; it is almost complete.

The output will list all indices currently undergoing a recovery, and then list all shards in each of those indices. Each shard will have stats about start/stop time, duration, recover percentage, bytes transferred, and more.

Canceling a Restore

To cancel a restore, you need to delete the indices being restored. Because a restore process is really just shard recovery, issuing a delete-index API alters the cluster state, which will in turn halt recovery. For example:

DELETE /restored_index_3

If restored_index_3 was actively being restored, this delete command would halt the restoration as well as deleting any data that had already been restored into the cluster.

Clusters Are Living, Breathing Creatures

Once you get a cluster into production, you’ll find that it takes on a life of its own. Elasticsearch works hard to make clusters self-sufficient and just work. But a cluster still requires routine care and feeding, such as routine backups and upgrades.

Elasticsearch releases new versions with bug fixes and performance enhancements at a very fast pace, and it is always a good idea to keep your cluster current. Similarly, Lucene continues to find new and exciting bugs in the JVM itself, which means you should always try to keep your JVM up-to-date.

This means it is a good idea to have a standardized, routine way to perform rolling restarts and upgrades in your cluster. Upgrading should be a routine process, rather than a once-yearly fiasco that requires countless hours of precise planning.

Similarly, it is important to have disaster recovery plans in place. Take frequent snapshots of your cluster—and periodically test those snapshots by performing a real recovery! It is all too common for organizations to make routine backups but never test their recovery strategy. Often you’ll find a glaring deficiency the first time you perform a real recovery (such as users being unaware of which drive to mount). It’s better to work these bugs out of your process with routine testing, rather than at 3 a.m. when there is a crisis.

All materials on the site are licensed Creative Commons Attribution-Sharealike 3.0 Unported CC BY-SA 3.0 & GNU Free Documentation License (GFDL)

If you are the copyright holder of any material contained on our site and intend to remove it, please contact our site administrator for approval.