ElasticSearch Cookbook, Second Edition (2015)

Chapter 10. Java Integration

In this chapter we will cover the following recipes:

· Creating an HTTP client

· Creating a native client

· Managing indices with the native client

· Managing mappings

· Managing documents

· Managing bulk actions

· Building a query

· Executing a standard search

· Executing a search with aggregations

· Executing a scroll/scan search

Introduction

ElasticSearch functionalities can be easily integrated in any Java application in several ways, both via the REST API and native ones.

With the use of Java it's easy to call a REST HTTP interface with one of the many libraries available, such as Apache HttpComponents Client (http://hc.apache.org/). In this field there's no such thing as a most-used library; typically developers choose the library that suits their preferences the best or that they know very well.

Each JVM language can also use the native protocol to integrate ElasticSearch with their applications. The native protocol, discussed in Chapter 1, Getting Started, is one of the faster protocols available to communicate with ElasticSearch due to many factors, such as its binary nature, the fast native serializer/deserializer of data, the asynchronous approach for communicating, and the hop reduction (native client nodes are able to communicate directly with the node that contains the data without executing the double hop needed in REST calls).

The main disadvantage of using native protocol is that it evolves during the development life cycle of ElasticSearch and there is no guarantee of compatibility between versions. For example, if a field of a request or a response changes, their binary serialization changes, generating incompatibilities between client and server with different versions.

The ElasticSearch community tries not to change often but, in every version, some parts of ElasticSearch are improved, and these changes often modify the native API call signature, thus breaking the applications. It is recommended to use the REST API when integrating with ElasticSearch as it is much more stable between versions.

In this chapter, we will see how to initialize different clients and how to execute the commands that we have seen in the previous chapters. We will not go into every call in depth as we have already described the REST API ones.

ElasticSearch uses the native protocol and API internally, so these are the most tested ones compared to REST calls, due to unit and integration tests available in the ElasticSearch code base. The official documentation for the native Java API is available athttp://www.elasticsearch.org/guide/en/elasticsearch/client/java-api/current/, but it doesn't cover all the API calls.

If you want a complete set of examples, they are available in the src/test directory.

As we have already discussed in Chapter 1, Getting Started, the ElasticSearch community recommends using the REST calls when integrating, as they are more stable between releases and well documented.

All the code presented in these recipes is available in the book code repository and can be built with Maven.

Creating an HTTP client

An HTTP client is one of the easiest clients to create. It's very handy because it allows calling not only internal methods as the native protocol does, but also third-party calls implemented in plugins that can be called only via HTTP.

Getting ready

You will need a working ElasticSearch cluster and Maven installed. The code of this recipe is in the chapter_10/http_client directory, present in the code bundle available on the Packt website and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition).

How to do it...

To create an HTTP client, we will perform the following steps:

1. For these examples, we have chosen the Apache HttpComponents, one of the most widely used libraries for executing HTTP calls. This library is available in the main Maven repository search.maven.org. To enable compilation in your Maven pom.xml project, just add:

2. <dependency>

3. <groupId>org.apache.httpcomponents</groupId>

4. <artifactId>httpclient</artifactId>

5. <version>4.3.5</version>

</dependency>

6. If we want to instantiate a client and fetch a document with a get method, the code will look like this:

7. import org.apache.http.*;

8. import org.apache.http.client.methods.CloseableHttpResponse;

9. import org.apache.http.client.methods.HttpGet;

10.import org.apache.http.impl.client.CloseableHttpClient;

11.import org.apache.http.impl.client.HttpClients;

12.import org.apache.http.util.EntityUtils;

13.import java.io.*;

14.public class App {

15. private static String wsUrl = "http://127.0.0.1:9200";

16. public static void main(String[] args) {

17. CloseableHttpClient client = HttpClients.custom()

18. .setRetryHandler(new MyRequestRetryHandler()).build();

19. HttpGet method = new HttpGet(wsUrl+"/test-index/test-type/1");

20. // Execute the method.

21. try {

22. CloseableHttpResponse response = client.execute(method);

23. if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK) {

24. System.err.println("Method failed: " + response.getStatusLine());

25. }else{

26. HttpEntity entity = response.getEntity();

27. String responseBody = EntityUtils.toString(entity);

28. System.out.println(responseBody);

29. }

30. } catch (IOException e) {

31. System.err.println("Fatal transport error: " + e.getMessage());

32. e.printStackTrace();

33. } finally {

34. // Release the connection.

35. method.releaseConnection();

36. }

37. }

}

The result, if the document is available, will be:

{"_index":"test-index","_type": "test-type","_id":"1","_version":1,"exists":true, "_source" : {…}}

How it works...

We performed the preceding steps to create and use an HTTP client.

The first step is to initialize the HTTP client object. In the previous code this is done using the following:

CloseableHttpClient client = HttpClients.custom()

.setRetryHandler(new MyRequestRetryHandler()).build();

Before using the client, it is a good practice to customize it. In general, the client can be modified to provide extra functionalities such as retry support. Retry support is very important for designing robust applications. For example, the IP network protocol is never 100 percent reliable, so automatically retrying an action if something goes wrong (http connection closed, server overhead, and so on), is a good practice.

In the previous code, we have defined an HttpRequestRetryHandler method that monitors the execution and repeats it three times before raising an error.

After having set up the client, we can define the method call. In the previous example, if we want to execute the GET REST call, the used method will be HttpGet and the URL will be item index/type/id (similar to the curl example in the Getting a document recipe inChapter 4, Basic Operations). To initialize this method the code is as follows:

HttpGet method = new HttpGet(wsUrl+"/test-index/test-type/1");

To improve the quality of our REST call it's a good practice to add extra controls to the method such as authentication and custom headers.

The ElasticSearch server by default doesn't require authentication, so we need to provide a security layer at the top of our architecture. A typical scenario is using your HTTP client with the Jetty plugin, (https://github.com/sonian/elasticsearch-jetty) that allows extending ElasticSearch REST with authentication and SSL. After the plugin is installed and configured on the server, the following code adds a host entry that allows providing credentials only if context calls are targeting that host. The authentication is simply basic auth, but it works very well for non-complex deployment.

HttpHost targetHost = new HttpHost("localhost", 9200, "http");

CredentialsProvider credsProvider = new BasicCredentialsProvider();

credsProvider.setCredentials(

new AuthScope(targetHost.getHostName(), targetHost.getPort()),

new UsernamePasswordCredentials("username", "password"));

// Create AuthCache instance

AuthCache authCache = new BasicAuthCache();

// Generate BASIC scheme object and add it to the local auth cache

BasicScheme basicAuth = new BasicScheme();

authCache.put(targetHost, basicAuth);

// Add AuthCache to the execution context

HttpClientContext context = HttpClientContext.create();

context.setCredentialsProvider(credsProvider);

The create context must be used in executing the call as follows:

response = client.execute(method, context);

Custom headers allow passing extra information to the server for executing a call. Some examples could be API key or hints about supported formats. A typical example is using gzip data compression over HTTP to reduce bandwidth usage. To do that, we can add a custom header to the call, informing the server that our client accepts encoding Accept-Encoding using Gzip:

request.addHeader("Accept-Encoding", "gzip");

After having configured the call with all the parameters, we can fire up the request as follows:

response = client.execute(method, context);

Every response object must be validated on its return status. If the call is OK, the return status should be 200. In the preceding code the check is done in the if statement as follows:

if (response.getStatusLine().getStatusCode() != HttpStatus.SC_OK)

If the call was OK—and the status code of the response is 200—we can read the answer:

HttpEntity entity = response.getEntity();

String responseBody = EntityUtils.toString(entity);

The response is wrapped in HttpEntity, which is a stream. HTTP client library provides a helper method EntityUtils.toString that reads all the content of HttpEntity as a string. Otherwise we need to create some code to read from the string and build the string.

Obviously all the read part of the call is wrapped in a try/catch block to collect all the possible errors due to networking.

There's more

Apache HttpComponents is one of the most used libraries in the Java world to write a REST API client. It provides a lot of out-of-the-box advanced features such as cookies, authentication, and transport layers.

Tip

There isn't any recommended client for HTTP REST calls in the ElasticSearch community. One of the Java libraries written to resolve this problem is Jest (https://github.com/searchbox-io/Jest) but, at the time of writing this book, it is not a complete feature.

See also

· The Apache HttpComponents on http://hc.apache.org/

· The Jetty plugin to provide authenticated ElasticSearch access on https://github.com/sonian/elasticsearch-jetty

· Jest on https://github.com/searchbox-io/Jest

· The Using the HTTP protocol recipe in Chapter 1, Getting started

· The Getting a document recipe in Chapter 4, Basic Operations

Creating a native client

There are two ways to create a native client in order to communicate with an ElasticSearch server:

· Creating a client node (a node that doesn't contain data, but works as an arbiter) and getting the client from it. This node will appear in the cluster state nodes and it's able to use the discovery capabilities of ElasticSearch to join the cluster (so no node address is required to connect to a cluster). This client is able to reduce node routing due to its knowledge of cluster topology.

· Creating a transport client, which is a standard client that requires the address and port of nodes to connect.

In this recipe we will see how to create these clients.

Getting ready

You will need a working ElasticSearch cluster and a working copy of Maven.

The code of this recipe is in chapter_10/nativeclient in the code bundle available on Packt's website and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition).

How to do it...

To create a native client, we will perform the following steps:

1. Before starting, make sure that Maven loads the elasticsearch.jar and adds it to pom.xml as follows:

2. <dependency>

3. <groupId>org.elasticsearch</groupId>

4. <artifactId>elasticsearch</artifactId>

5. <version>1.4.0</version>

</dependency>

Tip

I always suggest using the latest available release of ElasticSearch or, in the case of connection to a specific cluster, the same version of ElasticSearch as the cluster.

Native clients only work if the client and the server have the same ElasticSearch version.

6. Now to create a client, we have two ways:

· Using a node:

· import static org.elasticsearch.node.NodeBuilder.*;

· // on startup

· Node node = nodeBuilder().clusterName("elasticsearch").client(true). node();

· Client client = node.client();

· // on shutdown

node.close();

· Using the transport protocol:

· final Settings settings = ImmutableSettings.settingsBuilder()

· .put("client.transport.sniff", true)

· .put("cluster.name", "elasticsearch").build();

· Client client = new TransportClient(settings)

.addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300));

How it works...

The first action to create a native client is to create a node. We set it as a client node and we retrieve the client from it. The steps are:

1. Import the NodeBuilder class:

import static org.elasticsearch.node.NodeBuilder.*;

2. Initializie an ElasticSearch node by passing cluster.name and indicating that it's a client one (otherwise, it can be considered as a standard node; after joining the cluster, it fetches data from shards to load-balance the cluster):

Node node = nodeBuilder().clusterName("elasticsearch").client(true). node();

3. We can now retrieve a client from the node using the following line of code:

Client client = node.client();

4. If the client is retrieved from an embedded node, before closing the application, we need to free the resource needed by the node. This can be done by calling the close method on the node:

node.close();

The second way to create a native client is to create a transport client.

The steps to create a transport client are:

1. Create the settings required to configure the client. Typically they hold the cluster name and some other options that we'll discuss later:

2. final Settings settings = ImmutableSettings.settingsBuilder()

3. .put("client.transport.sniff", true)

.put("cluster.name", "elasticsearch").build();

4. Now we can create the client by passing it settings, addresses, and the port of our cluster as follows:

5. new TransportClient(settings)

.addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300));

The addTransportAddress method can be called several times until all the required addresses and ports are set.

Using any of these approaches, the result is the same—that is, a working client that allows you to execute native calls on an ElasticSearch server.

In both approaches, it is important to correctly define the name of the cluster; otherwise there will be problems in node joining or the transport client will give you warning of invalid names.

The client node is a complete ElasticSearch client node, so pay attention to defining that. It must be considered as a client node.

There's more

There are several settings that can be passed when creating a transport client. They are listed as follows:

· client.transport.sniff: This is by default false. If activated, the client retrieves the other node addresses after the first connection, reading them by the cluster state and reconstructing the cluster topology.

· client.transport.ignore_cluster_name: This is by default false. If you set it to true, cluster name validation of connected nodes is ignored. This prevents printing a warning if the client cluster name is different from the connected cluster name.

· client.transport.ping_timeout: This is by default set to 5s. Every client pings the node to check its state. This value defines how much time a client should wait before a timeout.

· client.transport.nodes_sampler_interval: This is also by default set to 5s. This interval defines how often to sample/ping the nodes listed and connected. These pings reduce failures if a node is down and allows balancing the requests with the available node.

See also

· The Setting up for Linux systems recipe in Chapter 2, Downloading and Setting Up

· The Using the native protocol recipe in Chapter 1, Getting Started

Managing indices with the native client

In the previous recipe we saw how to initialize a client to send calls to an ElasticSearch cluster. In this recipe, we will see how to manage indices via client calls.

Getting ready

You will need a working ElasticSearch cluster and a working copy of Maven.

The code of this recipe is in chapter_10/nativeclient in the code bundle, which can be downloaded from Packt's website, and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition). The referred class is IndicesOperations.

How to do it...

The ElasticSearch client maps all index operations under the admin.indices object of the client. Here, all the index operations are listed, such as create, delete, exists, open, close, optimize, and so on.

The following steps show how to retrieve a client and execute the main operations on indices:

1. The first step is importing the required classes:

2. import org.elasticsearch.action.admin.indices.exists.indices. IndicesExistsResponse;

import org.elasticsearch.client.Client;

3. Then we define an IndicesOperations class that manages the index operations:

4. public class IndicesOperations {

5. private final Client client;

6. public IndicesOperations(Client client) {

7. this.client = client;

}

8. We define a function used to check the index's existence:

9. public boolean checkIndexExists(String name){

10. IndicesExistsResponse response=client.admin().indices().prepareExists(name). execute().actionGet();

11. return response.isExists();

}

12. We define a function used to create an index:

13. public void createIndex(String name){

14. client.admin().indices().prepareCreate(name).execute(). actionGet();

}

15. We define a function used to delete an index:

16. public void deleteIndex(String name){

17. client.admin().indices().prepareDelete(name).execute(). actionGet();

}

18. We define a function used to close an index:

19. public void closeIndex(String name){

20. client.admin().indices().prepareClose(name).execute(). actionGet();

}

21. We define a function used to open an index:

22. public void openIndex(String name){

23. client.admin().indices().prepareOpen(name).execute(). actionGet();

}

24. We test all the previously defined functions:

25. public static void main( String[] args ) throws InterruptedException {

26. Client client =NativeClient.createTransportClient();

27. IndicesOperations io=new IndicesOperations(client);

28. String myIndex = "test";

29. if(io.checkIndexExists(myIndex))

30. io.deleteIndex(myIndex);

31. io.createIndex(myIndex);

32. Thread.sleep(1000);

33. io.closeIndex(myIndex);

34. io.openIndex(myIndex);

35. io.deleteIndex(myIndex);

36. }

}

How it works...

Before executing any index operation, a client must be available (we have seen how to create one in the previous recipe).

The client has a lot of methods grouped by functionalities as follows:

· In the root (client.*), we have record operations such as index, delete records, search, and update

· Under admin.indices.*, we have index-related methods such as creating an index, deleting an index, and so on

· Under admin.cluster.*, we have cluster-related methods such as state and health

The client methods usually follow some conventions. They are listed as follows:

· Methods starting from prepare* (that is, prepareCreate) return a request builder that can be executed with the execute method

· Methods that start with a verb (that is, create) require a build request and an optional action listener

After having built the request, it can be executed with an actionGet method that can receive an optional timeout, and a response is returned.

In the previous example we have seen several index calls:

· To check the existence of an index the method call is prepareExists and it returns an IndicesExistsResponse object that tells you if the index exists or not:

· IndicesExistsResponse response=client.admin().indices().prepareExists(name). execute().actionGet();

return response.isExists();

· To create an index with the prepareCreate call:

client.admin().indices().prepareCreate(name).execute().actionGet();

· To close an index with the prepareClose call:

client.admin().indices().prepareClose(name).execute().actionGet();

· To open an index with the prepareOpen call:

client.admin().indices().prepareOpen(name).execute().actionGet();

· To delete an index with the prepareDelete call:

client.admin().indices().prepareDelete(name).execute().actionGet();

Tip

In the code we have put a delay of 1 second (Thread.wait(1000)) to prevent fast actions on indices, because their shard allocations are asynchronous and they require some milliseconds to be ready. The best practice is not to have a similar hack, but to poll an index's state before performing further operations and to perform those operations only when it goes green.

See also

· The Creating an index, Deleting an index, and Opening/closing an index recipes in Chapter 4, Basic Operations

Managing mappings

After creating an index, the next step is to add some mapping to it. We have already seen how to apply a mapping via the REST API in Chapter 4, Basic Operations. In this recipe, we will see how to manage mappings via a native client.

Getting ready

You will need a working ElasticSearch cluster and a working copy of Maven.

The code of this recipe is in chapter_10/nativeclient in the code bundle of this book, available on Packt's website, and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition). The referred class is MappingOperations.

How to do it...

The following steps show how to add a mytype mapping to a myindex index via a native client:

1. We import the required classes:

2. import org.elasticsearch.action.admin.indices.mapping.put. PutMappingResponse;

3. import org.elasticsearch.client.Client;

4. import org.elasticsearch.common.xcontent.XContentBuilder;

5. import java.io.IOException;

import static org.elasticsearch.common.xcontent.XContentFactory. jsonBuilder;

6. We define a class to contain our code and to initialize the client and the index:

7. public class MappingOperations {

8. public static void main( String[] args )

9. {

10. String index="mytest";

11. String type="mytype";

12. Client client =NativeClient.createTransportClient();

13. IndicesOperations io=new IndicesOperations(client);

14. if(io.checkIndexExists(index))

15. io.deleteIndex(index);

io.createIndex(index);

16. We prepare the JSON mapping to put in the index:

17. XContentBuilder builder = null;

18. try {

19. builder = jsonBuilder().

20. startObject().

21. field("type1").

22. startObject().

23. field("properties").

24. startObject().

25. field("nested1").

26. startObject().

27. field("type").

28. value("nested").

29. endObject().endObject().endObject().

endObject();

30. We put the mapping in the index:

31.PutMappingResponse response=client.admin().indices().preparePutMapping(index).setType(type).setSource(builder).execute().actionGet();

32.if(!response.isAcknowledged()){

33. System.out.println("Something strange happens");

34. }

35. } catch (IOException e) {

36. ex.printStackTrace();

37. System.out.println("Unable to create mapping");

}

38. We delete the mapping in the index and remove the index:

39.client.admin().indices().prepareDeleteMapping(index).setType(type).execute().actionGet();

40. io.deleteIndex(index);

41. }

}

How it works...

Before executing a mapping operation, a client must be available and the index must be created. In the previous example, if the index exists, it's deleted and recreated as a new one, so we are sure to start from scratch:

Client client =NativeClient.createTransportClient();

IndicesOperations io=new IndicesOperations(client);

if(io.checkIndexExists(index)) io.deleteIndex(index);

io.createIndex(index);

Now that we have a fresh index to put the mapping in, we need to create a mapping. As every standard object in ElasticSearch is a JSON object, ElasticSearch provides a convenient way to create a JSON object programmatically via XContentBuilder.jsonBuilder. For using them, you need to add the following imports to your Java file:

import org.elasticsearch.common.xcontent.XContentBuilder;

import static org.elasticsearch.common.xcontent.XContentFactory.jsonBuilder;

The XContentBuilder.jsonBuilder object allows building JSON programmatically. It is a Swiss-knife of JSON generation in ElasticSearch and, due to its ability to be chained, it has a lot of methods. These methods always return a builder so they can be easily chained. The most important ones are:

· startObject() and startObject(name): Here name is the name of the JSON object. It defines a JSON object. The object must be closed with endObject().

· field(name) or field(name, value): Here name must always be a string, and value must be a valid value that can be converted to JSON. It's used to define a field in the JSON object.

· value(value): Here value must be a valid value that can be converted to JSON. It defines a single value in a field.

· startArray() and startArray(name): Here name is the name of the JSON array. It defines a JSON array that must be ended with an endArray().

Generally in ElasticSearch every method that accepts a JSON object as a parameter also accepts a JSON builder.

Now that we have the mapping in the builder, we need to call the putmapping API. This API is in the client.admin().indices() namespace and you need to define the index, the type, and the mapping to execute this call as follows:

PutMappingResponse response=client.admin().indices().preparePutMapping(index).setType(type).setSource(builder).execute().actionGet();

If everything is ok, you can check the status in response.isAcknowledged(); it must be true. Otherwise an error is raised.

If you need to update a mapping, you need to execute the same call, but in the mapping put only the fields that you need to add.

To delete a mapping, you need to call the delete mapping API. It requires the index and the type to be deleted. In the previous example, the previously created mapping is deleted using the following code:

client.admin().indices().prepareDeleteMapping(index).setType(type).execute().actionGet();

There's more

There is another important call used in managing the mapping called the get mapping API. The call is similar to a delete call, and returns a GetMappingResponse object:

GetMappingResponse response=client.admin().indices().prepareGetMapping(index).setType(type).execute().actionGet();

The response contains the mapping information. The data returned is structured as in an index map that contains mapping mapped as name MappingMetaData.

The MappingMetaData is an object that contains all the mapping information and all the sections that we discussed in Chapter 4, Basic Operations.

See also

· The Putting a mapping in an index, Getting a mapping, and Deleting a mapping recipes in Chapter 4, Basic Operations

Managing documents

The native APIs for managing documents (index, delete, and update) are the most important after the search ones. In this recipe, we will see how to use them. In the next recipe we will proceed to bulk actions to improve performance.

Getting ready

You will need a working ElasticSearch cluster and a working copy of Maven.

The code of this recipe is in chapter_10/nativeclient in the code bundle of this chapter, present on Packt's website, and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition). The referred class is DocumentOperations.

How to do it...

For managing documents, we will perform the following operations:

· We'll execute all the document's CRUD operations (Create, Update, Delete) via a native client using the following code:

· import org.elasticsearch.action.delete.DeleteResponse;

· import org.elasticsearch.action.get.GetResponse;

· import org.elasticsearch.action.index.IndexResponse;

· import org.elasticsearch.action.update.UpdateResponse;

· import org.elasticsearch.client.Client;

· import org.elasticsearch.common.xcontent.XContentFactory;

· import java.io.IOException;

· public class DocumentOperations {

· public static void main( String[] args )

· {

· String index="mytest";

· String type="mytype";

· Client client =NativeClient.createTransportClient();

· IndicesOperations io=new IndicesOperations(client);

· if(io.checkIndexExists(index))

· io.deleteIndex(index);

· try {

· client.admin().indices().prepareCreate(index)

· .addMapping(type, XContentFactory.jsonBuilder()

· .startObject()

· .startObject(type)

· .startObject("_timestamp").field("enabled", true).field("store", "yes").endObject()

· .startObject("_ttl").field("enabled", true).field("store", "yes").endObject()

· .endObject()

· .endObject())

· .execute().actionGet();

· } catch (IOException e) {

· System.out.println("Unable to create mapping");

· }

· // We index a document

· IndexResponse ir=client.prepareIndex(index, type, "2").setSource("text","unicorn").execute().actionGet();

· System.out.println("Version: "+ir.getVersion());

· // We get a document

· GetResponse gr=client.prepareGet(index, type, "2").execute().actionGet();

· System.out.println("Version: "+gr.getVersion());

· // We update a document

· UpdateResponse ur = client.prepareUpdate(index, type, "2").setScript("ctx._source.text = 'v2'" , ScriptService.ScriptType.INLINE).execute().actionGet();

· System.out.println("Version: "+ur.getVersion());

· // We delete a document

· DeleteResponse dr = client.prepareDelete(index, type, "2").execute().actionGet();

· io.deleteIndex(index);

· }

}

· The result will be:

· Aug 24, 2014 13:58:21 PM org.elasticsearch.plugins

· INFO: [Masked Rose] loaded [], sites []

· Version: 1

Version: 2

The document version is always incremented by 1 after an update action is performed or if the document is re-indexed with the new changes.

How it works...

Before executing a document action, a client and the index must be available and a document mapping should be created (the mapping is optional, because it can be inferred from the indexed document).

To index a document via a native client, the method prepareIndex is created. It requires the index and the type to be passed as arguments. If an ID is provided, it will be used; otherwise a new one will be created. In the previous example, we have put the source in the form of (key, value), but many forms are available to pass as a source. They are:

· A JSON string: "{field:value}"

· A string and a value (from 1 to 4 couples): field1, value1, field2, value2, field3, value3, field4, value4

· A builder: jsonBuilder().startObject().field(field,value).endObject()

· A byte array

Obviously it's possible to add all the parameters that we have seen in the Indexing a document recipe in Chapter 4, Basic Operations, such as parent, routing, and so on. In the previous example, the call was:

IndexResponse ir=client.prepareIndex(index, type, "2").setSource("text", "unicorn").execute().actionGet();

The return value (IndexReponse) can be used in several ways:

· Checking if the index was successfully added or not

· Getting the ID of the indexed document, if it was not provided during the index action

· Retrieving the document version

To retrieve a document, you need to know the index, type, and ID, and the client method is prepareGet. It requires the usual triplet (index, type, ID), but a lot of other methods are also available to control the routing (such as souring and parent) or fields, as we have seen in the Getting a document recipe in Chapter 4, Basic Operations. In the previous example, the call was:

GetResponse gr=client.prepareGet(index, type, "2").execute().actionGet();

The return type (GetResponse) contains all the request (if the document exists) and document information (source, version, index, type, ID).

To update a document, it's required to know the index, type, and ID, and provide a script or a document to be used for the update. The client method is prepareUpdate. In the previous example, the code was:

UpdateResponse ur = client.prepareUpdate(index, type, "2").setScript("ctx._source.text = 'v2'" , ScriptService.ScriptType.INLINE).execute().actionGet();

The script code must be a string. If the script language is not defined, the default (Groovy) is used.

The returned response contains information about the execution and the new version value to manage concurrency.

To delete a document (without needing to execute a query), we must know the index, type, and ID, and we can use the client method prepareDelete to create a delete request. In the previous code, we have used:

DeleteResponse dr = client.prepareDelete("test", "type", "2").execute().actionGet();

The delete request allows passing all the parameters that we have seen in the Deleting a document recipe in Chapter 4, Basic Operations to control routing and the version.

See also

· The Indexing a document, Getting a document, Deleting a document, and Updating a document recipes in Chapter 4, Basic Operations

Managing bulk actions

Executing atomic operation on items via single call is often a bottleneck if you need to index or delete thousands/millions of records. The best practice in this case is to execute a bulk action. We have discussed bulk action via the REST API in the Speeding up atomic operations (bulk operations) recipe in Chapter 4, Basic Operations.

Getting ready

You will need a working ElasticSearch cluster and Maven installed.

The code of this recipe is in chapter_10/nativeclient in the code bundle of this chapter, which is available on Packt's website, and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition). The referred class is BulkOperations.

How to do it...

To manage a bulk action, we will perform the following actions:

· We'll execute a bulk action adding 1,000 documents, updating them and deleting them as follows:

· import org.elasticsearch.action.bulk.BulkRequestBuilder;

· import org.elasticsearch.client.Client;

· import org.elasticsearch.common.xcontent.XContentFactory;

· import java.io.IOException;

· public class BulkOperations {

· public static void main( String[] args )

· {

· String index="mytest";

· String type="mytype";

· Client client =NativeClient.createTransportClient();

· IndicesOperations io=new IndicesOperations(client);

· if(io.checkIndexExists(index))

· io.deleteIndex(index);

· try {

· client.admin().indices().prepareCreate(index)

· .addMapping(type, XContentFactory.jsonBuilder()

· .startObject()

· .startObject(type)

· .startObject("_timestamp").field("enabled", true).field("store", "yes").endObject()

· .startObject("_ttl").field("enabled", true).field("store", "yes").endObject()

· .endObject()

· .endObject())

· .execute().actionGet();

· } catch (IOException e) {

· System.out.println("Unable to create mapping");

· }

· BulkRequestBuilder bulker=client.prepareBulk();

· for (Integer i=1; i<=1000; i++){

· bulker.add(client.prepareIndex(index, type, i.toString()).setSource("position", i.toString()));

· }

· System.out.println("Number of action: " + bulker.numberOfActions());

· bulker.execute().actionGet();

· System.out.println("Number of actions for index: " + bulker.numberOfActions());

· bulker.execute().actionGet();

· bulker=client.prepareBulk();

· for (Integer i=1; i<=1000; i++){

· bulker.add(client.prepareUpdate(index, type, i.toString()).setScript("ctx._source.position += 2" , ScriptService.ScriptType.INLINE));

· }

· System.out.println("Number of actions for update: " + bulker.numberOfActions());

· bulker.execute().actionGet();

· bulker=client.prepareBulk();

· for (Integer i=1; i<=1000; i++){

· bulker.add(client.prepareDelete(index, type,i.toString()));

· }

· System.out.println("Number of actions for delete: " + bulker.numberOfActions());

· bulker.execute().actionGet();

· io.deleteIndex(index);

· }

}

· The result will be:

· Number of actions for index: 1000

· Number of actions for udpate: 1000

Number of actions for delete: 1000

How it works...

Before executing these bulk actions, a client must be available and the index and document mapping must be created (the mapping is optional).

We can consider bulkBuilder as a collector of different actions:

· IndexRequest or IndexRequestBuilder

· UpdateRequest or UpdateRequestBuilder

· DeleteRequest or DeleteRequestBuilder

· A bulk formatted array of bytes.

Generally when used in a code, we can consider it as a "List" in which we add actions of the supported types.

To initialize bulkBuilder we use:

BulkRequestBuilder bulker=client.prepareBulk();

In the previous example we have added 1,000 index actions (the IndexBuilder is similar to the previous recipe):

for (Integer i=1; i<=1000; i++){

bulker.add(client.prepareIndex(index, type, i.toString()).setSource("position", i.toString()));

}

After having added all the actions, we can print the number of actions and then execute them:

System.out.println("Number of action: " + bulker.numberOfActions());

bulker.execute().actionGet();

After having executed bulkBuilder, the bulker is empty. We have populated the bulk with 1,000 update actions:

for (Integer i=1; i<=1000; i++){

bulker.add(client.prepareUpdate(index, type, i.toString()).setScript("ctx._source.position += 2" , ScriptService.ScriptType.INLINE));

}

After having added all the update actions, we can execute them in a bulk as follows:

bulker.execute().actionGet();

After, the same step is done with the delete action:

for (Integer i=1; i<=1000; i++){

ulker.add(client.prepareDelete(index, type, i.toString()));

}

To commit the delete operation, we need to execute the bulk.

Note

In this example, to simplify a bulk operation, I have created a bulk with the same type of actions but, as described previously, you can put up any supported type of action in the same bulk operation.

See also

· The Speeding up atomic operations (bulk operations) recipe in Chapter 4, Basic Operations

Building a query

Before a search, a query must be built and ElasticSearch provides several ways to build these queries. In this recipe, we will see how to create a query object via QueryBuilder and via simple strings.

Getting ready

You will need a working ElasticSearch cluster and a working copy of Maven. The code of this recipe is in chapter_10/nativeclient in the code bundle available on Packt's website, and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition). The referred class is QueryCreation.

How to do it...

To create a query, we will perform the following steps:

1. There are several ways to define a query in ElasticSearch and they are interchangeable. Generally a query can be defined as a combination of the following components:

· QueryBuilder: This is a helper to build a query.

· XContentBuilder: This is a helper to create JSON code. We have discussed it in the Managing mappings recipe in this chapter. The JSON code to be generated is similar to the previous REST, but converted in programmatic code.

· Array of bytes or string: In this case, it's usually the JSON to be executed as we have seen in REST calls.

· Map: It contains the query and the value of the query.

2. We'll create a query using QueryBuilder and execute a search (searching via a native API will be discussed in the next recipe):

3. …truncated …

4. import org.elasticsearch.common.xcontent.XContentFactory;

5. import org.elasticsearch.index.query.BoolQueryBuilder;

6. import org.elasticsearch.index.query.QueryBuilder;

7. import org.elasticsearch.index.query.RangeQueryBuilder;

8. import org.elasticsearch.index.query.TermFilterBuilder;

9. import java.io.IOException;

10.import static org.elasticsearch.index.query.QueryBuilders.*;

11.import static org.elasticsearch.index.query.FilterBuilders.*;

12.public class QueryCreation {

13. public static void main( String[] args )

14. {

15. String index="mytest";

16. … truncated …

17. BulkRequestBuilder bulker=client.prepareBulk();

18. for (Integer i=1; i<1000; i++){

19. bulker.add(client.prepareIndex(index, type, i.toString()).setSource("text", i.toString(), "number1", i+1, "number2", i%2));

20. }

21. bulker.execute().actionGet();

22. client.admin().indices().prepareRefresh(index).execute().actionGet();

23. TermFilterBuilder filter = termFilter("number2", 1);

24. RangeQueryBuilder range = rangeQuery("number1").gt(500);

25. BoolQueryBuilder bool = boolQuery().must(range);

26. QueryBuilder query = filteredQuery(bool, filter);

27. SearchResponse response=client.prepareSearch(index).setTypes(type).setQuery(query).execute().actionGet();

28. System.out.println("Matched records of elements: " + response.getHits().getTotalHits());

29. io.deleteIndex(index);

30. }

}

I have removed the redundant parts that are similar to the example in the previous recipe.

31. The result will be:

Matched records of elements:250

How it works...

In the preceding example, we created a query via QueryBuilder. The first step is to import the query builder from the namespace:

import static org.elasticsearch.index.query.QueryBuilders.*;

But we also need the field builders and, to import them, use the following line of code:

import static org.elasticsearch.index.query.FilterBuilders.*;

The query used in the example is a filtered query composed by BooleanQuery and a term filter. The goal of the example is to show how to mix several query/filter types for creating a complex query.

The Boolean query contains a must clause with a range query. Use the following code to create the range query:

RangeQueryBuilder range = rangeQuery("number1").gte(500);

This range query matches the number1 field to all the values greater than or equal to gte (500).

After having created the range query, we can add it to a Boolean query in the must block:

BoolQueryBuilder bool = boolQuery().must(range);

In real-world complex queries, you can have a lot of nested queries in a Boolean query or filter.

To build our filtered query, we need to define a filter. In this case we have used a term filter, which is one of the most used filters:

TermFilterBuilder filter = termFilter("number2", 1);

The termFilter method accepts a field name and a value, which must be a valid ElasticSearch type. The preceding code is similar to the JSON or REST {term: {number2:1}.

Now, we can build the final filtered query that we can execute in the search:

QueryBuilder query = filteredQuery(bool, filter);

Tip

Before executing a query and to be sure not to miss any results, the index must be refreshed. In the example, it's done with the help of the following code: client.admin().indices().prepareRefresh(index).execute().actionGet();

There's more

The possible native queries/filters are the same as the REST ones and have the same parameters but the only difference is that they are accessible via builder methods.

The most common query builders are:

· matchAllQuery: This allows matching of all the documents.

· matchQuery and matchPhraseQuery: These are used to match against text strings.

· termQuery and termsQuery: These are used to match term value(s) against a specific field.

· boolQuery: This is used to aggregate other queries with Boolean logic.

· idsQuery: This is used to match a list of ids.

· fieldQuery: This is used to match a field with a text.

· wildcardQuery: This is used to match terms with wildcards (*,?).

· regexpQuery: This is used to match terms via a regular expression.

· Span query family (spanTermsQuery, spanTermQuery, spanORQuery, spanNotQuery, spanFirstQuery, and so on): These are a few examples of the span query family. They are used in building a span query.

· filteredQuery: In this, the query is combined with a filter where the filter applies first.

· constantScoreQuery: This accepts a query or a filter and all the matched documents are set with the same score.

· moreLikeThisQuery and fuzzyLikeThisQuery: These are used to retrieve similar documents.

· hasChildQuery, hasParentQuery, and nestedQuery: These are used in managing related documents.

The preceding list is not complete, because it is evolving during the life of ElasticSearch. New query types are added to cover new search cases or they are occasionally renamed such as Text Query to Match Query.

Similar to the query builders, there are a lot of query filters, explained as follows:

· matchAllFilter: This matches all the documents

· termFilter and termsFilter: These are used to filter given value(s)

· idsFilter: This is used to filter a list of ids

· typeFilter: This is used to filter all the documents of the same type

· andFilter, orFilter, and notFilter: These are used to build Boolean filters

· wildcardFilter: This is used to filter terms with wildcards (*,?)

· regexpFilter: This is used to filter terms via a regular expression

· rangeFilter: This is used to filter using a range

· scriptFilter: This is used to filter documents using the scripting engine

· geoDistanceFilter, geoBoundingBoxFilter, and other geo filters: These provide geo filtering of documents

· boolFilter: This is used to create a Boolean filter that aggregates other filters

See also

· The Querying/filtering for a single term recipe in Chapter 5, Search, Queries, and Filters

Executing a standard search

In the previous recipe, we saw how to build a query. In this recipe we can execute this query to retrieve some documents.

Getting ready

You will need a working ElasticSearch cluster and a working copy of Maven.

The code of this recipe is in chapter_10/nativeclient, in the code bundle placed on Packt's website, and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition). The referred class is QueryExample.

How to do it...

To execute a standard query, we will perform the following steps:

1. After having created a query, it is enough to use the prepareQuery call in order to execute it and pass it your query object. Here is a complete example:

2. import org.elasticsearch.action.search.SearchResponse;

3. import org.elasticsearch.client.Client;

4. import org.elasticsearch.index.query.QueryBuilder;

5. import org.elasticsearch.search.SearchHit;

6. import static org.elasticsearch.index.query.FilterBuilders.*;

7. import static org.elasticsearch.index.query.QueryBuilders.*;

8. public class QueryExample {

9. public static void main(String[] args) {

10. String index = "mytest";

11. String type = "mytype";

12. QueryHelper qh = new QueryHelper();

13. qh.populateData(index, type);

14. Client client=qh.getClient();

15. QueryBuilder query = filteredQuery(boolQuery().must(rangeQuery("number1").gte(50 0)), termFilter("number2", 1));

16. SearchResponse response = client.prepareSearch(index).setTypes(type)

17. .setQuery(query).addHighlightedField("name")

18. .execute().actionGet();

19. if(response.status().getStatus()==200){

20. System.out.println("Matched number of documents: " + response.getHits().totalHits());

21. System.out.println("Maximum score: " + response.getHits().maxScore());

22. for(SearchHit hit: response.getHits().getHits()){

23. System.out.println("hit: "+hit.getIndex()+":"+hit.getType()+":" +hit.getId());

24. }

25. }

26. qh.dropIndex(index);

27. }

}

28. The result should be similar to this one:

29.Matched number of documents: 251

30.Maximum score: 1.0

31.hit: mytest:mytype:505

32.hit: mytest:mytype:517

33.hit: mytest:mytype:529

34.hit: mytest:mytype:531

35.hit: mytest:mytype:543

36.hit: mytest:mytype:555

37.hit: mytest:mytype:567

38.hit: mytest:mytype:579

39.hit: mytest:mytype:581

hit: mytest:mytype:593

How it works...

The call to execute a search is prepareSearch and it returns SearchResponse:

import org.elasticsearch.action.search.SearchResponse;

….

SearchResponse response = client.prepareSearch(index).setTypes(type).setQuery(query).execute().actionGet();

The search call has a lot of methods to allow setting all the parameters that we have already seen in the Executing a search recipe in Chapter 5, Search, Queries, and Filters. The most used methods are:

· setIndices: This allows defining the indices to be used.

· setTypes: This allows defining the document types to be used.

· setQuery: This allows setting the query to be executed.

· addField(s): This allows setting fields to be returned (used to reduce the bandwidth by returning only the needed fields).

· addAggregation: This allows adding aggregations to be computed.

· addFacet (Deprecated): This allows adding facets to be computed.

· addHighlighting: This allows highlighting results to be returned. The simple case is to highlight a field name as follows:

.addHighlightedField("name")

· addScriptField: This allows returning a scripted field. A scripted field is a field computed by server-side scripting using one of the available scripting languages. For example :

· Map<String, Object> params = MapBuilder.<String, Object>newMapBuilder().put("factor", 2.0).map();

.addScriptField("sNum1", "doc['num1'].value * factor", params)

After having executed a search, a response object is returned.

It's good practice to check if the search is successful or not, by checking the returned status and, optionally, the number of hits. If the search was executed correctly, then the return status is 200.

if(response.status().getStatus()==200)

The response object contains a lot of sections that we have analyzed in the Executing a Search recipe in Chapter 5, Search, Queries, and Filters. The most important one is the hits section that contains our results. The main methods that access this section are:

· totalHits: This allows obtaining the total number of results:

System.out.println("Matched number of documents: " + response.getHits().totalHits());

· maxScore: This gives the maximum score of the documents. It is the same score value of the first SearchHit method:

System.out.println("Maximum score: " + response.getHits().maxScore());

· hits: This is SearchHit array that contains the results, if available.

The SearchHit is the result object. It has a lot of methods, of which the most important ones are:

· index():This is the index that contains the document.

· type(): This is the type of the document.

· id(): This is the ID of the document.

· score(): This is the query score of this document, if available.

· version(): This is the version of the document, if available.

· source(), sourceAsString(), sourceAsMap(), and so on: These return the source of the document in different forms, if available.

· explanation(): If available (required in the search), it contains the query explanation.

· fields, field(String name): These return the fields requested if fields are passed to search the object.

· sortValues(): This is the value/values used to sort the record. It's only available if sort is specified during the search phase.

· shard(): This is the shard of the search hit. This value is very important in the case of custom routing.

In the following example, we have printed only the index, type, and ID of each hit.

for(SearchHit hit: response.getHits().getHits()){

System.out.println("hit: "+hit.getIndex()+":"+hit.getType()+":"+hit.getId());

}

Tip

The number of returned hits, if not defined, is limited to 10. To retrieve more hits you need to define a larger value in the size method or paginate using the from method.

See also

· The Executing a search recipe in Chapter 5, Search, Queries, and Filters

Executing a search with aggregations

The previous recipe can be extended to support aggregations and to retrieve analytics on indexed data.

Getting ready

You will need a working ElasticSearch cluster and a working copy of Maven.

The code of this recipe is in chapter_10/nativeclient folder in the code bundle of this chapter available on Packt's website, and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition). The referred class is AggregationExample.

How to do it...

To execute a search with aggregations, we will perform the following steps:

1. We'll calculate two different aggregations (terms and extended statistics) as follows:

2. import org.elasticsearch.action.search.SearchResponse;

3. import org.elasticsearch.client.Client;

4. import org.elasticsearch.search.aggregations.AggregationBuilder;

5. import org.elasticsearch.search.aggregations.bucket.terms.Terms;

6. import org.elasticsearch.search.aggregations.metrics.stats.extended.ExtendedStats;

7. import org.elasticsearch.search.aggregations.metrics.stats.extended.ExtendedStatsBuilder;

8. import static org.elasticsearch.index.query.QueryBuilders.matchAllQuery;

9. import static org.elasticsearch.search.aggregations.AggregationBuilders.*;

10.public class AggregationExample {

11. public static void main(String[] args){

12. String index = "mytest";

13. String type = "mytype";

14. QueryHelper qh = new QueryHelper();

15. qh.populateData(index, type);

16. Client client = qh.getClient();

17. AggregationBuilder aggsBuilder = terms("tag").field("tag");

18. ExtendedStatsBuilder aggsBuilder2 = extendedStats("number1").field("number1");

19. SearchResponse response = client.prepareSearch(index).setTypes(type)

20. .setQuery(matchAllQuery()).addAggregation(aggsBuilder).

21. addAggregation(aggsBuilder2)

22. .execute().actionGet();

23. if (response.status().getStatus() == 200) {

24. System.out.println("Matched number of documents: " + response.getHits().totalHits());

25. Terms termsAggs = response.getAggregations().get("tag");

26. System.out.println("Aggregation name: " + termsAggs.getName());

27. System.out.println("Aggregation total: " + termsAggs.getBuckets().size());

28. for (Terms.Bucket entry : termsAggs.getBuckets()) {

29. System.out.println(" - " + entry.getKey() + " " + entry.getDocCount());

30. }

31. ExtendedStats extStats = response.getAggregations().get("number1");

32. System.out.println("Aggregation name: " + extStats.getName());

33. System.out.println("Count: " + extStats.getCount());

34. System.out.println("Min: " + extStats.getMin());

35. System.out.println("Max: " + extStats.getMax());

36. System.out.println("Standard Deviation: " + extStats.getStdDeviation());

37. System.out.println("Sum of Squares: " + extStats.getSumOfSquares());

38. System.out.println("Variance: " + extStats.getVariance());

39. }

40. qh.dropIndex(index);

41. }

}

42. The result should be similar to this:

43.Aug 24, 2014 4:07:43 PM org.elasticsearch.plugins

44.INFO: [Legion] loaded [], sites []

45.Matched number of documents: 1000

46.Aggregation name: tag

47.Aggregation total: 4

48. - nice 264

49. - bad 257

50. - amazing 247

51. - cool 232

52.Aggregation name: number1

53.Count: 1000

54.Min: 2.0

55.Max: 1001.0

56.Standard Deviation: 288.6749902572095

57.Sum of Squares: 3.348355E8

Variance: 83333.25

How it works...

The search part is similar to the previous example. In this case we have used a matchAllQuery, which matches all the documents. To execute an aggregation, first you need to create it. There are three ways to do so:

· Using a string that maps a JSON object

· Using XContentBuilder that will be used to produce a JSON

· Using AggregationBuilder

The first two ways are trivial; the third one requires the builders to be imported:

import static org.elasticsearch.search.aggregations.AggregationBuilders.*;

There are several types of aggregation, as we have already seen in Chapter 6, Aggregations. The first one, which we have created with AggregationBuilder, is a Terms one that collects and counts all terms occurrences in buckets:

AggregationBuilder aggsBuilder = terms("tag").field("tag");

The required value for every aggregation is the name passed in the builder constructor. In the case of a terms aggregation, the field is required to be able to process the request. (There are a lot of other parameters, see the Executing the terms aggregation recipe inChapter 6, Aggregations for full details).

The second aggregationBuilder that we have created is an extended statistical one based on the number1 numeric field:

ExtendedStatsBuilder aggsBuilder2 = extendedStats("number1").field("number1");

Now that we have created aggregationBuilders, we can add them on a search method via the addAggregation method:

SearchResponse response = client.prepareSearch(index).setTypes(type).setQuery(matchAllQuery()).addAggregation(aggsBuilder)

addAggregation(aggsBuilder2)

.execute().actionGet();

Now the response holds information about our aggregations. To access them we need to use the getAggregations method of the response.

The aggregations results are contained in a hash-like structure and you can retrieve them with the names that you have previously defined in the request.

To retrieve the first aggregation results we need to execute the following code:

Terms termsAggs = response.getAggregations().get("tag");

Now that we have an aggregation result of type Terms (see the Executing the terms aggregations recipe in Chapter 6, Aggregations), we can get the aggregation properties and iterate on buckets:

System.out.println("Aggregation name: " + termsAggs.getName());

System.out.println("Aggregation total: " + termsAggs.getBuckets().size());

for (Terms.Bucket entry : termsAggs.getBuckets()) {

System.out.println(" - " + entry.getKey() + " " + entry.getDocCount());

}

To retrieve the second aggregation result, because the result is of type ExtendedStats, you need to cast to it as follows:

ExtendedStats extStats = response.getAggregations().get("number1");

Now you can access the result properties of this kind of aggregation:

System.out.println("Aggregation name: " + extStats.getName());

System.out.println("Count: " + extStats.getCount());

System.out.println("Min: " + extStats.getMin());

System.out.println("Max: " + extStats.getMax());

System.out.println("Standard Deviation: " + extStats.getStdDeviation());

System.out.println("Sum of Squares: " + extStats.getSumOfSquares());

System.out.println("Variance: " + extStats.getVariance());

Using aggregations with a native client is quite easy; you only need to pay attention to the returned aggregation type to execute the correct type cast to access your results.

See also

· The Executing the terms aggregations and Executing the stats aggregations recipes in Chapter 6, Aggregations

Executing a scroll/scan search

Pagination with a standard query works very well if you are matching documents that do not change too often; otherwise, doing pagination with live data returns unpredictable results. To bypass this problem, ElasticSearch provides an extra parameter in the query called scroll.

Getting ready

You will need a working ElasticSearch cluster and a working copy of Maven.

The code of this recipe is in chapter_10/nativeclient in the code bundle, present on Packt's website and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition). The referred class is ScrollScanQueryExample.

How to do it...

The search is done in the same way as in the previous recipe. The main difference is a setScroll timeout that allows storing the result's ids for a query for a defined timeout in memory.

We can change the code of the previous recipe to use scroll in the following way:

import org.elasticsearch.action.search.SearchResponse;

import org.elasticsearch.action.search.SearchType;

import org.elasticsearch.client.Client;

import org.elasticsearch.common.unit.TimeValue;

import org.elasticsearch.index.query.QueryBuilder;

import static org.elasticsearch.index.query.FilterBuilders.termFilter;

import static org.elasticsearch.index.query.QueryBuilders.*;

public class ScrollScanQueryExample {

public static void main(String[] args) {

String index = "mytest";

String type = "mytype";

QueryHelper qh = new QueryHelper();

qh.populateData(index, type);

Client client=qh.getClient();

QueryBuilder query = filteredQuery(boolQuery().must(rangeQuery("number1").gte(500)), termFilter("number2", 1));

SearchResponse response = client.prepareSearch(index).setTypes(type)

.setQuery(query).setScroll(TimeValue.timeValueMinutes(2))

.execute().actionGet();

// do something with searchResponse.getHits()

while(response.getHits().hits().length!=0){

// do something with searchResponse.getHits()

//your code here

//next scroll

response = client.prepareSearchScroll(response.getScrollId()).setScroll(TimeValue.timeValueMinutes(2)).execute().actionGet();

}

SearchResponse searchResponse = client.prepareSearch()

.setSearchType(SearchType.SCAN)

.setQuery(matchAllQuery())

.setSize(100)

.setScroll(TimeValue.timeValueMinutes(2))

.execute().actionGet();

while (true) {

searchResponse = client.prepareSearchScroll(searchResponse.getScrollId()).setScroll(TimeValue.timeValueMinutes(2)).execute().actionGet();

// do something with searchResponse.getHits() if any

if (searchResponse.getHits().hits().length == 0) {

break;

}

qh.dropIndex(index);

}

How it works...

To use the result of scrolling, it's enough to add the setScroll method with a timeout to the method call. When using scrolling some behaviors must be considered:

· The timeout defines the time slice for which an ElasticSearch server stores the results. Asking for a scroll, after the timeout, will result in the server returning an error. So you must be careful with short timeouts.

· The scroll consumes memory until the scroll ends or a timeout is raised. Setting a large timeout period without consuming the data will result in unnecessary memory occupation. Using a large number of open scrollers consumes a lot of memory proportional to the number of ids and their related data (score, order, and so on) in the results.

· Scrolling, it's not possible to paginate the documents, as there is no start to it. The scrolling is designed to fetch consecutives results.

So a standard search is changed to a scroll in this way:

SearchResponse response = client.prepareSearch(index).setTypes(type).setQuery(query).setScroll(TimeValue.timeValueMinutes(2)).execute().actionGet();

The response contains the results that consist of a standard search plus a scroll ID that is required to fetch the next results.

To execute the scroll you need to call the prepareSearchScroll client method with a scroll ID and a new timeout. In the following example, we process all the result documents:

while(response.getHits().hits().length!=0){

// do something with searchResponse.getHits()

//your code here

//next scroll

response = client.prepareSearchScroll(response.getScrollId()).setScroll(TimeV alue.timeValueMinutes(2)).execute().actionGet();

}

To make sure that we are at the end of the scroll, we can check that no results are returned.

There are a lot of scenarios in which scroll is very important. For example, working on big data solutions where the result number is very huge, it's very easy to hit the timeout. In these scenarios it is important to have a good architecture in which you can fetch the results as fast as possible and, also, you don't have to process the results iteratively in the loop; however, it defers the result manipulation in a distributed way.

There's more

Scroll call is used in conjunction with scan queries (see the Executing a scan query recipe in Chapter 5, Search, Queries, and Filters). Scan queries allow you to execute a query and provide results in a scroll for fast performance.

The scan query consumes less memory than a standard scroll query because of the following reasons:

· It doesn't compute score and doesn't return it

· It doesn't allow sorting, so it is not necessary to store the order value(s) in memory

· It doesn't allow computing facets or aggregations

· It doesn't allow execution of a child query or nested query, which in turn reduces memory usage

The scan method collects the results and iterates them. It stores only the ids of the scan method and hence it is very useful when you need to return all the documents that match a query if the result set is very huge.

To execute a scan query, the search type value must be passed to the search call as follows:

SearchResponse searchResponse = client.prepareSearch()

.setSearchType(SearchType.SCAN)

.setQuery(matchAllQuery())

.setSize(100)

.setScroll(TimeValue.timeValueMinutes(2))

.execute().actionGet();

A big difference in using scan rather than the scroll is that the first call doesn't return hits but only the scroll id; thus, to get the first result you have to execute a new scroll query.

In the preceding code, the loop iterates until no results are available:

while (true) {

searchResponse = client.prepareSearchScroll(searchResponse.getScrollId()).setScroll(TimeValue.timeValueMinutes(2)).execute().actionGet();

// do something with searchResponse.getHits() if any

if (searchResponse.getHits().hits().length == 0) {

break;

}

See also

· The Executing a scan query recipe in Chapter 5: Search, Queries, and Filters