Basic Operations - ElasticSearch Cookbook, Second Edition (2015)

ElasticSearch Cookbook, Second Edition (2015)

Chapter 4. Basic Operations

In this chapter, we will cover:

· Creating an index

· Deleting an index

· Opening/closing an index

· Putting a mapping in an index

· Getting a mapping

· Deleting a mapping

· Refreshing an index

· Flushing an index

· Optimizing an index

· Checking if an index or type exists

· Managing index settings

· Using index aliases

· Indexing a document

· Getting a document

· Deleting a document

· Updating a document

· Speeding up atomic operations (bulk operations)

· Speeding up GET operations (multi GET)

Introduction

Before starting with indexing and searching in ElasticSearch, we need to learn how to manage indices and perform operations on documents. In this chapter, we'll discuss different operations on indices such as Create, Delete, Update, Read, and Open/Close. These operations are very important because they allow us to better define the container (index) that will store your documents. The index Create/Delete actions are similar to the SQL's Create/Delete database commands.

After the indices management part, we'll learn how to manage mappings, which will complete the discussion that we started in the previous chapter, and to lay the foundation for the next chapter, which is mainly centered on Search.

A large portion of this chapter is dedicated to CRUD (Create, Read, Update, Delete) operations on records that are the core of records storage and management in ElasticSearch.

To improve indexing performance, it's also important to understand bulk operations and avoid their common pitfalls.

This chapter doesn't cover operations involving queries; these will be the main theme of Chapter 5, Search, Queries, and Filters. Likewise, the Cluster operations will be discussed in Chapter 9, Cluster and Node Monitoring, because they are mainly related to control and monitoring the Cluster.

Creating an index

The first operation before starting to Index data in ElasticSearch is to create an index—the main container of our data.

An Index is similar to Database concept in SQL, a container for types, such as tables in SQL, and documents, such as records in SQL.

Getting ready

You will need a working ElasticSearch cluster.

How to do it...

The HTTP method to create an index is PUT (POST also works); the REST URL contains the index name:

http://<server>/<index_name>

To create an index, we will perform the following steps:

1. Using the command line, we can execute a PUT call:

2. curl -XPUT http://127.0.0.1:9200/myindex -d '{

3. "settings" : {

4. "index" : {

5. "number_of_shards" : 2,

6. "number_of_replicas" : 1

7. }

8. }

9. }'

10. The result returned by ElasticSearch, if everything goes well, should be:

{"acknowledged":true}

11. If the index already exists, then a 400 error is returned:

{"error":"IndexAlreadyExistsException[[myindex] Already exists]","status":400}

How it works...

There are some limitations to the Index name, due to accepted characters:

· ASCII letters (a-z)

· Numbers (0-9)

· Point ., minus -, ampersand &, and underscore _

Tip

The index name will be mapped to a directory on your storage.

During index creation, the replication can be set with two parameters in the settings/index object:

· number_of_shard: This controls the number of shards that compose the index (every shard can store up to 2^32 documents).

· number_of_replicas: This controls the number of replicas (how many times your data is replicated in the cluster for high availability). A good practice is to set this value to at least to 1.

The API call initializes a new index, which means:

· The index is created in a primary node first and then its status is propagated to the cluster level

· A default mapping (empty) is created

· All the shards required by the index are initialized and ready to accept data

The index creation API allows us to define the mapping during creation time. The parameter required to define a mapping is mapping and it accepts multiple mappings. So, in a single call, it is possible to create an Index and insert the required mappings.

There's more…

The index creation command also allows us to pass the mappings section, which contains the mapping definitions. It is a shortcut to create an index with mappings, without executing an extra PUT mapping call.

A common example of this call, using the mapping from the Putting a mapping in an index recipe, is:

curl -XPOST localhost:9200/myindex -d '{

"settings" : {

"number_of_shards" : 2,

"number_of_replicas" : 1

},

"mappings" : {

"order" : {

"properties" : {

"id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},

"date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"},

"customer_id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},

"sent" : {"type" : "boolean", "index":"not_analyzed"},

"name" : {"type" : "string", "index":"analyzed"},

"quantity" : {"type" : "integer", "index":"not_analyzed"},

"vat" : {"type" : "double", "index":"no"}

}

}

}

}'

See also

· The Understanding clusters, replication, and sharding recipe in Chapter 1, Getting Started

· The Putting a mapping in an index recipe in this chapter

Deleting an index

The counterpart of creating an index is deleting one.

Deleting an index means deleting its shards, mappings, and data. There are many common scenarios where we need to delete an index, such as the following:

· Removing the index to clean unwanted/obsolete data (for example, old logstash indices)

· Resetting an index for a scratch restart

· Deleting an index that has some missing shard, mainly due to some failure, to bring back the cluster to a valid state

Getting ready

You will need a working ElasticSearch cluster and the existing index created in the previous recipe.

How to do it...

The HTTP method used to delete an index is DELETE.

The URL contains only the index name:

http://<server>/<index_name>

To delete an index, we will perform the following steps:

1. From a command line, we can execute a DELETE call:

2. curl -XDELETE http://127.0.0.1:9200/myindex

3. The result returned by ElasticSearch, if everything goes well, should be:

{"acknowledged":true}

4. If the index doesn't exist, then a 404 error is returned:

{"error":"IndexMissingException[[myindex] missing]","status":404}

How it works...

When an index is deleted, all the data related to the index is removed from the disk and is lost.

During the deletion process, at first, the cluster is updated when the shards are deleted from the storage. This operation is fast; in the traditional filesystem it is implemented as a recursive delete.

It's not possible to restore a deleted index if there is no backup.

Also, calling by using the special _all, index_name can be used to remove all the indices. In production, it is a good practice to disable the all indices deletion parameter by adding the following line to elasticsearch.yml:

action.destructive_requires_name:true

See also

· The Creating an index recipe in this chapter

Opening/closing an index

If you want to keep your data but save resources (memory/CPU), a good alternative to deleting an Index is to close it.

ElasticSearch allows you to open or close an index to put it in online/offline mode.

Getting ready

You will need a working ElasticSearch cluster and the index created in the Creating an index recipe in this chapter.

How to do it...

For opening/closing an index, we will perform the following steps:

1. From the command line, we can execute a POST call to close an index:

2. curl -XPOST http://127.0.0.1:9200/myindex/_close

3. If the call is successful, the result returned by ElasticSearch should be:

{,"acknowledged":true}

4. To open an index from the command line, enter:

5. curl -XPOST http://127.0.0.1:9200/myindex/_open

6. If the call is successful, the result returned by ElasticSearch should be:

{"acknowledged":true}

How it works...

When an index is closed, there is no overhead on the cluster (except for the metadata state); the index shards are turned off and don't use file descriptors, memory, or threads.

There are many use cases for closing an index, such as the following:

· Disabling date-based indices, for example, keeping an index for a week, month, or day, and when you want to keep several indices online (such as for 2 months) and some offline (such as from 2 to 6 months).

· When you do searches on all the active indices of a cluster but you don't want to search in some indices (in this case, using an alias is the best solution, but you can achieve the same alias concept with closed indices).

When an index is closed, calling on open restores its state.

See also

· The Using index aliases recipe in this chapter

Putting a mapping in an index

In the previous chapter, we saw how to build a mapping by indexing documents. This recipe shows how to put a type of mapping in an index. This kind of operation can be considered the ElasticSearch version of an SQL create table.

Getting ready

You will need a working ElasticSearch cluster and the index created in the Creating an index recipe in this chapter.

How to do it...

The HTTP method for puttting a mapping is PUT (POST also works).

The URL format for putting a mapping is:

http://<server>/<index_name>/<type_name>/_mapping

To put a mapping in an Index, we will perform the following steps:

1. If we consider the type order of the previous chapter, the call will be:

2. curl -XPUT 'http://localhost:9200/myindex/order/_mapping' -d '{

3. "order" : {

4. "properties" : {

5. "id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},

6. "date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"},

7. "customer_id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},

8. "sent" : {"type" : "boolean", "index":"not_analyzed"},

9. "name" : {"type" : "string", "index":"analyzed"},

10. "quantity" : {"type" : "integer", "index":"not_analyzed"},

11. "vat" : {"type" : "double", "index":"no"}

12. }

13. }

}'

14. If successful, the result returned by ElasticSearch should be:

{"acknowledged":true}

How it works...

This call checks if the index exists and then it creates one or more types of mapping as described in the definition. For the mapping description, see the previous chapter.

During mapping insertion, if there is an existing mapping for this type, it is merged with the new one. If there is a field with a different type and the type cannot be updated by expanding the fields property, an exception is raised. To prevent exception during the merging mapping phase, it's possible to specify the parameter ignore_conflicts to true (default is false).

The PUT mapping call allows us to set the type for several indices in one shot, listing the indices separated by comma or applying to all indexes using the _all alias.

See also

· The Getting a mapping recipe in this chapter (the following recipe.)

Getting a mapping

After having set our mappings for processing types, we sometimes need to control or analyze the mapping to prevent issues. The action to get the mapping for a type helps us to understand the record structure, or its evolution due to merging and implicit type guessing.

Getting ready

You will need a working ElasticSearch cluster and the mapping created in the previous recipe.

How to do it...

The HTTP method to get a mapping is GET.

The URL formats for getting a mapping are:

http://<server>/_mapping

http://<server>/<index_name>/_mapping

http://<server>/<index_name>/<type_name>/_mapping

To get a mapping from the type of an index, we will perform the following steps:

1. If we consider the type order of the previous chapter, the call will be:

2. curl -XGET 'http://localhost:9200/myindex/order/_mapping?pretty=true'

The pretty argument in the URL will pretty print the response output.

3. The result returned by ElasticSearch should be:

4. {

5. "myindex" : {

6. "mappings" : {

7. "order" : {

8. "properties" : {

9. "customer_id" : {

10. "type" : "string",

11. "index" : "not_analyzed",

12. "store" : true

13. },

14.… truncated

15. }

16. }

17. }

18. }

}

How it works...

The mapping is stored at the cluster level in ElasticSearch. The call checks both index and type existence, and then returns the stored mapping.

Note

The returned mapping is in a reduced form, which means the default values for a field are not returned.

ElasticSearch stores only default field values to reduce network and memory consumption.

Querying the mapping is useful for several purposes:

· Debugging template level mapping

· Checking if implicit mapping was derivated correctly by guessing fields

· Retrieving the mapping metadata, which can be used to store type-related information

· Simply checking if the mapping is correct

If you need to fetch several mappings, it is better to do so at the index or cluster level in order to reduce the numbers of API calls.

See also

· The Putting a mapping recipe in this chapter

· The Using dynamic templates in document mapping recipe in Chapter 3, Managing Mapping

Deleting a mapping

The last CRUD (Create, Read, Update, Delete) operation related to mapping is the delete one.

Deleting a mapping is a destructive operation and must be done with care to prevent losing your data.

There are some use cases in which it's required to delete a mapping:

· Unused type: You delete it to clean the data.

· Wrong mapping: You might need to change the mapping, but you cannot upgrade it or remove some fields. You need to back up your data, create a new mapping, and reimport the data.

· Fast cleanup of a type: You can delete the mapping and recreate it (or you can execute a Delete by query, as explained in the Deleting by query recipe in Chapter 5, Search, Queries, and Filters.

Getting ready

You will need a working ElasticSearch cluster and the mapping created in the Putting a mapping in an index recipe in this chapter.

How to do it...

The HTTP method to delete a mapping is DELETE.

The URL formats for getting the mappings are:

http://<server>/<index_name>/<type_name>

http://<server>/<index_name>/<type_name>/_mapping

To delete a mapping from an index, we will perform the following steps:

1. If we consider the type order explained in the previous chapter, the call will be:

curl -XDELETE 'http://localhost:9200/myindex/order/'

2. If the call is successful, the result returned by ElasticSearch should be an HTTP 200 status code with a similar message as the following:

{"acknowledged":true}

3. If the mapping/type is missing, an exception is raised:

{"error":"TypeMissingException[[myindex] type[order] missing]","status":404}

How it works...

ElasticSearch tries to find the mapping for an Index-type pair. If it's found, the mapping and all its related data are removed. If it is not found, an exception is raised.

Note

Deleting a mapping removes all the data associated with that mapping, so it's not possible to go back if there is no backup.

See also

· The Putting a mapping in an index recipe in this chapter

Refreshing an index

ElasticSearch allows the user to control the state of the searcher using forced refresh on an index. If not forced, the new indexed document will only be searchable after a fixed time interval (usually 1 second).

Getting ready

You will need a working ElasticSearch cluster and the index created in the Creating an index recipe in this chapter.

How to do it...

The URL formats for refreshing an index, are:

http://<server>/<index_name(s)>/_refresh

The URL format for refreshing all the indices in a cluster, is:

http://<server>/_refresh

The HTTP method used for both URLs is POST.

To refresh an index, we will perform the following steps:

1. If we consider the type order of the previous chapter, the call will be:

2. curl -XPOST 'http://localhost:9200/myindex/_refresh'

3. The result returned by ElasticSearch should be:

{"_shards":{"total":4,"successful":2,"failed":0}}

How it works...

Near Real-Time (NRT) capabilities are automatically managed by ElasticSearch, which automatically refreshes the indices every second if data is changed in them.

You can call the refresh on one or more indices (most indices are comma separated) or on all the indices.

ElasticSearch doesn't refresh the state of an index at every inserted document to prevent poor performance, due to excessive I/O required in closing and reopening file descriptors.

Tip

You must force the refresh to have your last index data available for searching.

Generally, the best time to call the refresh is after having indexed a lot of data, to be sure that your records are searchable instantly.

See also

· The Flushing an index recipe in this chapter

· The Optimizing an index recipe in this chapter

Flushing an index

ElasticSearch, for performance reasons, stores some data in memory and on a transaction log. If we want to free memory, empty the transaction log, and be sure that our data is safely written on disk, we need to flush an index.

ElasticSearch automatically provides a periodic disk flush, but forcing a flush can be useful, for example:

· When we have to shutdown a node to prevent stale data

· To have all the data in a safe state (for example, after a big indexing operation to have all the data flushed and refreshed)

Getting ready

You will need a working ElasticSearch cluster and the index created in the Creating an index recipe in this chapter.

How to do it...

The HTTP method used for the URL operations is POST.

The URL format for flushing an index is:

http://<server>/<index_name(s)>/_flush[?refresh=True]

The URL format for flushing all the indices in a Cluster is:

http://<server>/_flush[?refresh=True]

To flush an index, we will perform the following steps:

1. If we consider the type order of the previous chapter, the call will be:

2. curl -XPOST 'http://localhost:9200/myindex/_flush?refresh=True'

3. The result returned by ElasticSearch, if everything goes well, should be:

{"_shards":{"total":4,"successful":2,"failed":0}}

The result contains the shard operation status.

How it works...

ElasticSearch tries not to put overhead in I/O operations caching some data in memory to reduce writes. In this way, it is able to improve performance.

To clean up memory and force this data on disk, the flush operation is required.

With the flush call, it is possible to make an extra request parameter, refresh, to also force the Index refresh.

Tip

Flushing too often affects index performances. Use it wisely!

See also

· The Refreshing an index recipe in this chapter

· The Optimizing an index recipe in this chapter

Optimizing an index

The core of ElasticSearch is based on Lucene, which stores the data in segments on the disk. During the life of an Index, a lot of segments are created and changed. With the increase of segment numbers, the speed of search decreases due to the time required to read all of them. The optimize operation allows us to consolidate the index for faster search performance, reducing segments.

Getting ready

You will need a working ElasticSearch cluster and the index created in the Creating an index recipe in this chapter.

How to do it...

The URL format to optimize one or more indices is:

http://<server>/<index_name(s)>/_optimize

The URL format to optimize all the indices in a cluster is:

http://<server>/_optimize

The HTTP method used is POST.

To optimize an index, we will perform the following steps:

1. If we consider the Index created in the Creating an index recipe, the call will be:

2. curl -XPOST 'http://localhost:9200/myindex/_optimize'

3. The result returned by ElasticSearch should be:

{"_shards":{"total":4,"successful":2,"failed":0}}

The result contains the shard operation status.

How it works...

Lucene stores your data in several segments on disk. These segments are created when you Index a new document/record or when you delete a document. Their number can be large (for this reason, in the setup, we have increased the file description number for the ElasticSearch process).

Internally, ElasticSearch has a merger, which tries to reduce the number of segments, but it's designed to improve the indexing performance rather than search performance. The optimize operation in Lucene tries to reduce the segments in an I/O intensive way, by removing unused ones, purging deleted documents, and rebuilding the Index with the minor number of segments.

The main advantages are:

· Reducing the file descriptors

· Freeing memory used by the segment readers

· Improving the search performance due to less segment management

Note

Optimization is a very I/O intensive operation. The index can be unresponsive during optimization. It is generally executed on indices that are rarely modified, such as consolidated date logstash indices.

There's more…

You can pass several additional parameters to the optimize call, such as:

· max_num_segments (by default autodetect): For full optimization, set this value to 1.

· only_expunge_deletes (by default false): Lucene does not delete documents from segments, but it marks them as deleted. This flag merges only segments that have been deleted.

· flush (by default true): Using this parameter, ElasticSearch performs a flush after optimization.

· wait_for_merge (by default true): This parameter is used if the request needs to wait until the merge ends.

· force (default false): Using this parameter, ElasticSearch executes the optimization even if the index is already optimized.

See also

· The Refreshing an index recipe in this chapter

· The Optimizing an index recipe in this chapter

Checking if an index or type exists

A common pitfall error is to query for indices and types that don't exist. To prevent this issue, ElasticSearch gives the user the ability to check the index and type existence.

This check is often used during an application startup to create indices and types that are required for it to work correctly.

Getting ready

You will need a working ElasticSearch cluster and the mapping available in the index, as described in the previous recipes.

How to do it...

The HTTP method to check the existence is HEAD.

The URL format for checking an index is:

http://<server>/<index_name>/

The URL format for checking a type is:

http://<server>/<index_name>/<type>/

To check if an index exists, we will perform the following steps:

1. If we consider the index created in the Creating an index recipe in this chapter, the call will be:

2. curl –i -XHEAD 'http://localhost:9200/myindex/'

The –i curl options allows dumping the server headers.

3. If the index exists, an HTTP status code 200 is returned. If missing, then a 404 error is returned.

To check if a type exists, we will perform the following steps:

1. If we consider the mapping created in the putting a mapping in an index recipe (in this chapter), the call will be:

2. curl –i -XHEAD 'http://localhost:9200/myindex/order/'

3. If the index exists, an HTTP status code 200 is returned. If missing, then a 404 error is returned.

How it works...

This is a typical HEAD REST call to check existence. It doesn't return a body response, only the status code.

Tip

Before executing every action involved in indexing, generally upon application startup, it's good practice to check if an index or type exists to prevent future failures.

Managing index settings

Index settings are more important because they allow us to control several important ElasticSearch functionalities such as sharding/replica, caching, term management, routing, and analysis.

Getting ready

You will need a working ElasticSearch cluster and the index created in the Creating an index recipe in this chapter.

How to do it...

To manage the index settings, we will perform the steps given as follows:

1. To retrieve the settings of your current Index, the URL format is the following:

http://<server>/<index_name>/_settings

2. We are reading information via REST API, so the method will be GET, and an example of a call using the index created in the Creating an index recipe, is:

3. curl -XGET 'http://localhost:9200/myindex/_settings'

4. The response will be something similar to:

5. {

6. "myindex" : {

7. "settings" : {

8. "index" : {

9. "uuid" : "pT65_cn_RHKmg1wPX7BGjw",

10. "number_of_replicas" : "1",

11. "number_of_shards" : "2",

12. "version" : {

13. "created" : "1020099"

14. }

15. }

16. }

17. }

}

The response attributes depend on the index settings. In this case, the response will be the number of replicas (1), and shard (2), and the index creation version (1020099). The UUID represents the unique ID of the index.

18. To modify the index settings, we need to use the PUT method. A typical settings change is to increase the replica number:

19.curl -XPUT 'http://localhost:9200/myindex/_settings' –d '

20.{"index":{ "number_of_replicas": "2"}}'

How it works...

ElasticSearch provides a lot of options to tune the index behavior, such as:

· Replica management:

· index.number_of_replica: This is the number of replicas each shard has

· index.auto_expand_replicas: This parameter allows us to define a dynamic number of replicas related to the number of shards

Tip

Using set index.auto_expand_replicas to 0-all allows us to create an index that is replicated in every node (very useful for settings or cluster propagated data such as language options/stopwords).

· Refresh interval (by default 1s): In the previous recipe, Refreshing an index, we saw how to manually refresh an index. The index settings (index.refresh_interval) control the rate of automatic refresh.

· Cache management: These settings (index.cache.*) control the cache size and its life. It is not common to change them (refer to ElasticSearch documentation for all the available options athttp://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-cache.html).

· Write management: ElasticSearch provides several settings to block read/write operations in an index and changing metadata. They live in index.blocks settings.

· Shard allocation management: These settings control how the shards must be allocated. They live in the index.routing.allocation.* namespace.

There are other index settings that can be configured for very specific needs. In every new version of ElasticSearch, the community extends these settings to cover new scenarios and requirements.

There is more…

The refresh_interval parameter provides several tricks to optimize the indexing speed. It controls the rate of refresh, and refreshing reduces the Index performances due to opening and closing of files. A good practice is to disable the refresh interval (set -1) during a big indexing bulk and restoring the default behavior after it. This can be done with the following steps:

1. Disabling the refresh:

2. curl -XPOST 'http://localhost:9200/myindex/_settings' –d '

3. {"index":{"index_refresh_interval": "-1"}}'

4. Bulk indexing some millions of documents

5. Restoring the refresh:

6. curl -XPOST 'http://localhost:9200/myindex/_settings' –d '

7. {"index":{"index_refresh_interval": "1s"}}'

8. Optionally, optimizing the index for search performances:

9. curl -XPOST 'http://localhost:9200/myindex/_optimize'

See also

· The Refreshing an index recipe in this chapter

· The Optimizing an index recipe in this chapter

Using index aliases

Real world applications have a lot of indices and queries that span on more indices. This scenario requires defining all the indices names on which we need to perform queries; aliases allow grouping them by a common name.

Some common scenarios of this usage are:

· Log indices divided by date (such as log_YYMMDD) for which we want to create an alias for the last week, the last month, today, yesterday, and so on. This pattern is commonly used in log applications such as logstash (http://logstash.net/).

· Collecting website contents in several indices (New York Times, The Guardian, and so on) for those we want to refer to as an index aliases called sites.

Getting ready

You will need a working ElasticSearch cluster.

How to do it...

The URL format for control aliases are:

http://<server>/_aliases

http://<server>/<index>/_alias/<alias_name>

To manage the index aliases, we will perform the following steps:

1. We need to read the status of the aliases for all indices via the REST API, so the method will be GET, and an example of a call is:

curl -XGET 'http://localhost:9200/_aliases'

2. It should give a response similar to this:

3. {

4. "myindex": {

5. "aliases": {}

6. },

7. "test": {

8. "aliases": {}

9. }

}

Aliases can be changed with add and delete commands.

10. To read an alias for a single Index, we use the _alias endpoint:

11.curl -XGET 'http://localhost:9200/myindex/_alias'

The result should be:

{

"myindex" : {

"aliases" : {

"myalias1" : { }

}

}

}

12. To add an alias:

13.curl -XPUT 'http://localhost:9200/myindex/_alias/myalias1'

The result should be:

{"acknowledged":true}

This action adds the myindex index to the myalias1 alias.

14. To delete an alias:

15.curl -XDELETE 'http://localhost:9200/myindex/_alias/myalias1'

The result should be:

{"acknowledged":true}

The delete action has now removed myindex from the alias myalias1.

How it works...

During search operations, ElasticSearch automatically expands the alias, so the required indices are selected. The alias metadata is kept in the cluster state. When an alias is added/deleted, all the changes are propagated to all the cluster nodes. Aliases are mainly functional structures to simply manage indices when data is stored in multiple indices.

There's more…

An alias can also be used to define a filter and routing parameters.

Filters are automatically added to the query to filter out data. Routing via aliases allows us to control which shards to hit during searching and indexing.

An example of this call is:

curl -XPOST 'http://localhost:9200/myindex/_aliases/user1alias' -d '

{

"filter" : {

"term" : { "user" : "user_1" }

},

"search_routing" : "1,2",

"index_routing" : "2"

}'

In this case, we add a new alias, user1alias, to an Index, myindex, adding:

· A filter to select only documents that match a field user with term user_1.

· A list of routing keys to select the shards to be used during the search.

· A routing key to be used during indexing. The routing value is used to modify the destination shard of the document.

Note

search_routing allows multi-value routing keys. index_routing is single value only.

Indexing a document

In ElasticSearch, there are two vital operations namely, Indexing and Searching.

Indexing means inserting one or more document in an index; this is similar to the insert command of a relational database.

In Lucene, the core engine of ElasticSearch, inserting or updating a document has the same cost. In Lucene and ElasticSearch, update means replace.

Getting ready

You will need a working ElasticSearch cluster and the mapping that was created in the Putting a mapping in an index recipe in this chapter.

How to do it...

To index a document, several REST entry points can be used:

Method

URL

POST

http://<server>/<index_name>/<type>

PUT/POST

http://<server>/<index_name>/<type> /<id>

PUT/POST

http://<server>/<index_name>/<type> /<id>/_create

We will perform the following steps:

1. If we consider the type order mentioned in earlier chapters, the call to index a document will be:

2. curl -XPOST 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw' -d '{

3. "id" : "1234",

4. "date" : "2013-06-07T12:14:54",

5. "customer_id" : "customer1",

6. "sent" : true,

7. "in_stock_items" : 0,

8. "items":[

9. {"name":"item1", "quantity":3, "vat":20.0},

10. {"name":"item2", "quantity":2, "vat":20.0},

11. {"name":"item3", "quantity":1, "vat":10.0}

12. ]

}'

13. If the index operation is successful, the result returned by ElasticSearch should be:

14.{

15. "_index":"myindex",

16. "_type":"order",

17. "_id":"2qLrAfPVQvCRMe7Ku8r0Tw",

18. "_version":1,

19. "created":true

}

Some additional information is returned from an indexing operation such as:

· An auto-generated ID, if not specified

· The version of the indexed document as per the Optimistic Concurrency Control

· Information if the record has been created

How it works...

One of the most used APIs in ElasticSearch is the index. Basically, indexing a JSON document consists of these steps:

· Routing the call to the correct shard based on the ID or routing/parent metadata. If the ID is not supplied by the client, a new one is created. (See Chapter 1, Getting Started, for more details).

· Validating the JSON which has been sent.

· Processing the JSON according to the mapping. If new fields are present in the document (the mapping can be updated), new fields are added in the mapping.

· Indexing the document in the shard. If the ID already exists, it is then updated.

· If it contains nested documents, it extracts them and processes them separately.

· Returns information about the saved document (ID and versioning).

It's important to choose the correct ID for indexing your data. If you don't provide an ID in ElasticSearch during the indexing phase, then it automatically associates a new ID to your document. To improve performance, the ID should generally be of the same character length to improve balancing of the data tree that holds them.

Due to the REST call nature, it's better to pay attention when not using ASCII characters because of URL encoding and decoding (or, be sure that the client framework you use correctly escapes them).

Depending on the mappings, other actions take place during the indexing phase, such as the propagation on replica, nested processing, and the percolator.

The document will be available for standard search calls after a refresh (forced with an API call or after the time slice of 1 second in near real time). Every GET API on the document doesn't require a refresh and can be instantly made available.

The refresh can also be forced by specifying the refresh parameter during indexing.

There's more…

ElasticSearch allows the passing of several query parameters in the index API URL for controlling how the document is indexed. The most commonly used ones are:

· routing: This controls the shard to be used for indexing, for example:

· curl -XPOST 'http://localhost:9200/myindex/order?routing=1'

· parent: This defines the parent of a child document and uses this value to apply routing. The parent object must be specified in the mappings, such as:

· curl -XPOST 'http://localhost:9200/myindex/order?parent=12'

· timestamp: This is the timestamp to be used in indexing the document. It must be activated in the mappings, such as in the following:

· curl -XPOST 'http://localhost:9200/myindex/order?timestamp= 2013-01-25T19%3A22%3A22'

· consistency (one/quorum/all): By default, an index operation succeeds if set as a quorum (>replica/2+1) and if active shards are available. The write consistency value can be changed for indexing:

· curl -XPOST 'http://localhost:9200/myindex/order?consistency=one'

· replication (sync/async): ElasticSearch returns replication from an index operation when all the shards of the current replication group have executed the operation. Setting the replication async allows us to execute the index synchronously only on the primary shard and asynchronously on other shards, returning from the call faster.

· curl -XPOST 'http://localhost:9200/myindex/order?replication=async'

· version: This allows us to use the Optimistic Concurrency Control (http://en.wikipedia.org/wiki/Optimistic_concurrency_control). At first, in the indexing of a document, version is set as 1 by default. At every update, this value is incremented. Optimistic Concurrency Control is a way to manage concurrency in every insert or update operation. The already passed version value is the last seen version (usually returned by a GET or a search). The indexing happens only if the current index version value is equal to the passed one:

· curl -XPOST 'http://localhost:9200/myindex/order?version=2'

· op_type: This can be used to force a create on a document. If a document with the same ID exists, the Index fails.

· curl -XPOST 'http://localhost:9200/myindex/order?op_type=create'…

· refresh: This forces a refresh after having the document indexed. It allows us to have the documents ready for search after indexing them:

· curl -XPOST 'http://localhost:9200/myindex/order?refresh=true'…

· ttl: This allows defining a time to live for a document. All documents in which the ttl has expired are deleted and purged from the index. This feature is very useful to define records with a fixed life. It only works if ttl is explicitly enabled in mapping. The value can be a date-time or a time value (a numeric value ending with s, m, h, d). The following is the command:

· curl -XPOST 'http://localhost:9200/myindex/order?ttl=1d'

· timeout: This defines a time to wait for the primary shard to be available. Sometimes, the primary shard can be in an un-writable status (relocating or recovering from a gateway) and a timeout for the write operation is raised after 1 minute.

· curl -XPOST 'http://localhost:9200/myindex/order?timeout=5m' …

See also

· The Getting a document recipe in this chapter

· The Deleting a document recipe in this chapter

· The Updating a document recipe in this chapter

· Optimistic Concurrency Control at http://en.wikipedia.org/wiki/Optimistic_concurrency_control

Getting a document

After having indexed a document during your application life, it most likely will need to be retrieved.

The GET REST call allows us to get a document in real time without the need of a refresh.

Getting ready

You will need a working ElasticSearch cluster and the indexed document of the Indexing a document recipe.

How to do it...

The GET method allows us to return a document given its index, type and ID.

The REST API URL is:

http://<server>/<index_name>/<type_name>/<id>

To get a document, we will perform the following steps:

1. If we consider the document we indexed in the previous recipe, the call will be:

2. curl –XGET 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw?pretty=true'

3. The result returned by ElasticSearch should be the indexed document:

4. {

5. "_index":"myindex","_type":"order","_id":"2qLrAfPVQvCRMe7Ku8r0Tw","_version":1,"found":true, "_source" : {

6. "id" : "1234",

7. "date" : "2013-06-07T12:14:54",

8. "customer_id" : "customer1",

9. "sent" : true,

10. "items":[

11. {"name":"item1", "quantity":3, "vat":20.0},

12. {"name":"item2", "quantity":2, "vat":20.0},

13. {"name":"item3", "quantity":1, "vat":10.0}

14. ]

}}

Our indexed data is contained in the _source parameter, but other information is returned as well:

· _index: This is the index that stores the document

· _type: This denotes the type of the document

· _id: This denotes the ID of the document

· _version: This denotes the version of the document

· found: This denotes if the document has been found

15. If the record is missing, a 404 error is returned as the status code and the return JSON will be:

16.{

17. "_id": "2qLrAfPVQvCRMe7Ku8r0Tw",

18. "_index": "myindex",

19. "_type": "order",

20. "found": false

}

How it works...

ElasticSearch GET API doesn't require a refresh on the document. All the GET calls are in real time. This call is fast because ElasticSearch is implemented to search only on the shard that contains the record without other overhead. The IDs are often cached in memory for faster lookup.

The source of the document is only available if the _source field is stored (the default settings in ElasticSearch).

There are several additional parameters that can be used to control the GET call:

· fields: This allows us to retrieve only a subset of fields. This is very useful to reduce bandwidth or to retrieve calculated fields such as the attachment mapping ones:

· curl 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw?fields=date,sent'

· routing: This allows us to specify the shard to be used for the GET operation. To retrieve a document with the routing used in indexing, the time taken must be the same as the search time:

· curl 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw?routing=customer_id'

· refresh: This allows us to refresh the current shard before doing the GET operation. (It must be used with care because it slows down indexing and introduces some overhead):

· curl http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw?refresh=true

· preference: This allows controlling which shard replica to choose to execute the GET operation. Generally, ElasticSearch chooses a random shard for the GET call. Possible values are:

· _primary: This is used for the primary shard.

· _local: This is used for trying first the local shard and then falling back to a random choice. Using the local shard reduces the bandwidth usage and should generally be used with auto—replicating shards (with the replica set to 0).

· custom value: This is used for selecting shard-related values such as the customer_id, username, and so on.

There's more…

The GET API is fast, so a good practice for developing applications is to try to use it as much as possible. Choosing the correct ID during application development can lead to a big boost in performance.

If the shard that contains the document is not bound to an ID, then fetching the document requires a query with an ID filter (We will learn about it in the Using an ID query/filter recipe in Chapter 5, Search, Queries, and Filters).

If you don't need to fetch the record, but only need to check the existence, you can replace GET with HEAD and the response will be status code 200 if it exists, or 404 error, if missing.

The GET call has also a special endpoint, _source that allows fetching only the source of the document.

The GET Source REST API URL is:

http://<server>/<index_name>/<type_name>/<id>/_source

To fetch the source of the previous order, we will call:

curl –XGET http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw/_source

See also

· The Speeding up GET operation recipe in this chapter

Deleting a document

Deleting documents in ElasticSearch is possible in two ways: by using the DELETE call or the DELETE BY QUERY call; we will learn about it in the next chapter.

Getting ready

You will need a working ElasticSearch cluster and the indexed document of the Indexing a document recipe in this chapter.

How to do it...

The REST API URL is the same as the GET calls, but the HTTP method is DELETE:

http://<server>/<index_name>/<type_name>/<id>

To delete a document, we will perform the following steps:

1. If we consider the order index in the Indexing a document recipe, the call to delete a document will be:

2. curl -XDELETE 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw'

3. The result returned by ElasticSearch should be:

4. {

5. "_id": "2qLrAfPVQvCRMe7Ku8r0Tw",

6. "_index": "myindex",

7. "_type": "order",

8. "_version": 2,

9. "found": true

}

10. If the record is missing, a 404 error is returned as the status code and the return JSON will be:

11.{

12. "_id": "2qLrAfPVQvCRMe7Ku8r0Tw",

13. "_index": "myindex",

14. "_type": "order",

15. "_version": 2,

16. "found": false

}

How it works...

Deleting a record only hits the shards that contain the document, so there is no overhead.

If the document is a child, the parent must be set to look for the correct shard.

There are several additional parameters that can be used to control the DELETE call. The most important ones are:

· routing: This allows us to specify the shard to be used for the DELETE operation

· version: This allows to define a version of the document to be deleted to prevent its modification

· parent: This is similar to routing, and is required if the document is a child one

Tip

The DELETE operation doesn't have restore functionality. Every document that is deleted is lost forever.

Deleting a record is a fast operation, and is easy to use if the IDs of the documents to delete are available. Otherwise, we must use the DELETE BY QUERY call, which we will explore in the next chapter.

See also

· The Deleting by query recipe in Chapter 5, Search, Queries, and Filters

Updating a document

Documents stored in ElasticSearch can be updated at any time throughout their lives. There are two available solutions to perform this operation in ElasticSearch, namely by adding the new document, or by using the update call.

The update call works in two ways:

1. By providing a script (based on supported ElasticSearch scripting languages) which contains the code that must be executed to update the record

2. By providing a document that must be merged with the original one

The main advantage of an update versus an index is the reduction in networking.

Getting ready

You will need a working ElasticSearch cluster and the indexed document of the Indexing a document recipe in this chapter. To use the dynamic scripting languages, the dynamic scripting languages must be enabled (see Chapter 7, Scripting, to learn more).

How to do it...

As we are changing the state of the data, the HTTP method is POST and the REST URL is:

http://<server>/<index_name>/<type_name>/<id>/_update

To update a document, we will perform the following steps:

1. If we consider the type order of the previous recipe, the call to update a document will be:

2. curl -XPOST 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw/_update' –d '{

3. "script" : "ctx._source.in_stock_items += count",

4. "params" : {

5. "count" : 4

}}'

6. If the request is successful, the result returned by ElasticSearch should be:

7. {

8. "_id": "2qLrAfPVQvCRMe7Ku8r0Tw",

9. "_index": "myindex",

10. "_type": "order",

11. "_version": 3,

12. "found": true,

13. "ok": true

}

14. The record will be:

15.{

16. "_id": "2qLrAfPVQvCRMe7Ku8r0Tw",

17. "_index": "myindex",

18. "_source": {

19. "customer_id": "customer1",

20. "date": "2013-06-07T12:14:54",

21. "id": "1234",

22. "in_stock_items": 4,

23.…

24. "sent": true

25. },

26. "_type": "order",

27. "_version": 3,

28. "exists": true

}

The visible changes are:

· The scripted field is changed

· The version is incremented

29. If you are using ElasticSearch (Version 1.2 or above) and you have disabled scripting support (default configuration), an error will be raised:

30.{

31. "error":"ElasticsearchIllegalArgumentException[failed to execute script]; nested: ScriptException[dynamic scripting disabled]; ",

32. "status":400

}

How it works...

The update operation applies changes to the document required in the script or in the update document, and it will reindex the changed document. In Chapter 7, Scripting, we will explore the scripting capabilities of ElasticSearch.

The standard language for scripting in ElasticSearch is Groovy (http://groovy.codehaus.org/), and is used in the examples.

The script can operate on the ctx._source, which is the source of the document (it must be stored to work), and can change the document in its place.

It's possible to pass parameters to a script by passing a JSON object. These parameters are available in the execution context.

A script can control the ElasticSearch behavior after the script execution by setting the ctx.op value of the context. Available values are:

· ctx.op="delete": Using this, the document will be deleted after the script execution.

· ctx.op="none": Using this, the document will skip the indexing process. A good practice to improve performance is to set the ctx.op="none" to prevent reindexing overhead if the script doesn't update the document.

In the ctx, it is possible to pass a ttl value to change the time of the life of an element by setting the ctx._ttl parameter.

The ctx parameter also manages the timestamp of the record in ctx._timestamp.

It's also possible to pass an additional object in the upsert property to be used if the document is not available in the index:

curl -XPOST 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw/_update' –d '{

"script" : "ctx._source.in_stock_items += count",

"params" : {

"count" : 4

},

"upsert" : {"in_stock_items":4}}'

If you need to replace some field values, a good solution is to not write a complex update script, but to use the special property doc, which allows to overwrite the values of an object. The document provided in the doc parameter will be merged with the original one. This approach is more easy to use, but it cannot set the ctx.op. So, if the update doesn't change the value of the original document, the next successive phase will always be executed:

curl -XPOST 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw/_update' –d '{"doc" : {"in_stock_items":10}}'

If the original document is missing, it is possible to use the provided doc for an upsert by providing the doc_as_upsert parameter:

curl -XPOST 'http://localhost:9200/myindex/order/2qLrAfPVQvCRMe7Ku8r0Tw/_update' –d '{"doc" : {"in_stock_items":10}, "doc_as_upsert":true}'

Using MVEL, it is possible to apply advanced operations on field, such as:

· Removing a field:

"script" : {"ctx._source.remove("myfield"}}

· Adding a new field:

"script" : {"ctx._source.myfield=myvalue"}}

The update REST call is useful because it has some advantages:

· It reduces the bandwidth usage, as the update operation doesn't need a round-trip to the client holding the data.

· It's safer, because it automatically manages the Optimistic Concurrent Control. If a change happens during script execution, the script gets re-executed with the updated data.

· It can be bulk executed.

See also

· The Speeding up atomic operations recipe in this chapter (the next recipe)

Speeding up atomic operations (bulk operations)

When we are inserting, deleting, or updating a large number of documents, the HTTP overhead is significant. To speed up the process, ElasticSearch allow to execute bulk CRUD (Create, Read, Update, Delete) calls.

Getting ready

You will need a working ElastiSearch cluster.

How to do it...

As we are changing the state of the data, the HTTP method is POST and the REST URL is:

http://<server>/<index_name/_bulk

To execute a bulk action, we will perform the steps given as follows:

1. We need to collect the Create, Index, Delete, Update commands in a structure made of bulk JSON lines, composed by a line of action with metadata and another line with optional data related to the action. Every line must end with a newline, \n. The bulk data file should be, for example:

2. { "index":{ "_index":"myindex", "_type":"order", "_id":"1" } }

3. { "field1" : "value1", "field2" : "value2" }

4. { "delete":{ "_index":"myindex", "_type":"order", "_id":"2" } }

5. { "create":{ "_index":"myindex", "_type":"order", "_id":"3" } }

6. { "field1" : "value1", "field2" : "value2" }

7. { "update":{ "_index":"myindex", "_type":"order", "_id":"3" } }

{ "doc":{"field1" : "value1", "field2" : "value2" }}

8. This file can be sent with POST:

9. curl -s -XPOST localhost:9200/_bulk --data-binary @bulkdata;

10. The result returned by ElasticSearch should collect all the responses of the actions.

How it works...

The bulk operation allows aggregating different calls as a single call. A header part with the action that is to be performed and a body for some operations such as Index, Create, and Update are present.

The header is composed by the action name and the object of parameters. Looking at the previous example for the, index, we have:

{ "index":{ "_index":"myindex", "_type":"order", "_id":"1" } }

For indexing and creating, an extra body is required with the data:

{ "field1" : "value1", "field2" : "value2" }

The delete action doesn't require optional data, so only the header composes it:

{ "delete":{ "_index":"myindex", "_type":"order", "_id":"1" } }

In the 0.90 or upper range, ElasticSearch allows to execute bulk update too:

{ "update":{ "_index":"myindex", "_type":"order", "_id":"3" } }

The header accepts all the common parameters of the update action, such as doc, upsert, doc_as_upsert, lang, script, and params. To control the number of retries in the case of concurrency, the bulk update defines the parameter _retry_on_conflict, set to the number of retries to be performed before raising an exception.

A possible body for the update is:

{ "doc":{"field1" : "value1", "field2" : "value2" }}

The bulk item can accept several parameters, such as:

· routing: To control the routing shard

· parent: To select a parent item shard, it is required if you are indexing some child documents

· timestamp: To set the index item timestamp

· ttl: To control the time to live of a document

Global bulk parameters that can be passed through query arguments are:

· consistency (one, quorum, all) (by default, quorum): This controls the number of active shards before executing write operations.

· refresh,(by default, false): This forces a refresh in the shards that are involved in bulk operations. The new indexed document will be available immediately without waiting for the standard refresh interval of 1s.

Usually, ElasticSearch client libraries that use ElasticSearch REST API automatically implements the serialization of bulk commands.

The correct number of commands to serialize in bulk is a user choice, but there are some hints to consider:

· In standard configuration, ElasticSearch limits the HTTP call to 100 megabytes in size. If the size is over the limit, the call is rejected.

· Multiple complex commands take a lot of time to be processed, so pay attention to client timeout.

· The small size of commands in bulk doesn't improve performance.

If the documents aren't big, 500 commands in bulk can be a good number to start with, and it can be tuned depending on data structures (number of fields, number of nested objects, complexity of fields, and so on).

See also

· Bulk API can also be used via UDP. See ElasticSearch documentation for more details at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk-udp.html.

Speeding up GET operations (multi GET)

The standard GET operation is very fast, but if you need to fetch a lot of documents by the ID, ElasticSearch provides multi GET operations.

Getting ready

You will need a working ElasticSearch Cluster and the document indexed from the Indexing a document recipe in this chapter.

How to do it...

The multi GET REST URLs are:

http://<server</_mget

http://<server>/<index_name>/_mget

http://<server>/<index_name>/<type_name>/_mget

To execute a multi GET action, we will perform the following steps:

1. The method is POST with a body that contains a list of document IDs and the Index/type if they are missing. As an example, using the first URL, we need to provide the Index, type, and ID:

2. curl –XPOST 'localhost:9200/_mget' -d '{

3. "docs" : [

4. {

5. "_index" : "myindex",

6. "_type" : "order",

7. "_id" : "2qLrAfPVQvCRMe7Ku8r0Tw"

8. },

9. {

10. "_index" : "myindex",

11. "_type" : "order",

12. "_id" : "2"

13. }

14. ]

15.}'

This kind of call allows fetching documents in several different indices and types.

16. If the Index and type are fixed, a call should also be in the form of:

17.curl 'localhost:9200/test/type/_mget' -d '{

18. "ids" : ["1", "2"]

19.}'

The multi GET result is an array of documents.

How it works...

The multi GET call is a shortcut for executing many GET commands in one shot.

ElasticSearch internally spreads the GET in parallel on several shards and collects the results to return to the user.

The GET object can contain the following parameters:

· _index: The index that contains the document, it can be omitted if passed in the URL

· _type: The type of the document, it can be omitted if passed in the URL

· _id: The document ID

· fields (optional): A list of fields to retrieve

· routing (optional): The shard routing parameter

The advantages of the multi GET operation are:

· Reduced networking traffic both internally and externally in ElasticSearch

· Speeds up performace if used in an application, the time of processing a multi GET is quite similar to a standard GET.

See also...

· The Getting a document recipe in this chapter