Python Integration - ElasticSearch Cookbook, Second Edition (2015)

ElasticSearch Cookbook, Second Edition (2015)

Chapter 11. Python Integration

In this chapter, we will cover the following recipes:

· Creating a client

· Managing indices

· Managing mappings

· Managing documents

· Executing a standard search

· Executing a search with aggregations

Introduction

In the previous chapter, we saw how we can use a native client to access the ElasticSearch server with a Java implementation. This chapter is dedicated to the Python language and managing common tasks via its clients.

Apart from Java, the ElasticSearch team supports official clients for Perl, PHP, Python, .Net, and Ruby (see the announcement post on the ElasticSearch blog at http://www.elasticsearch.org/blog/unleash-the-clients-ruby-python-php-perl/.). This is pretty new as the initial public release was in September 2013. These clients have a lot of advantages against other implementations. A few of them are mentioned here:

· The clients are strongly tied to the ElasticSearch API, as defined here:

All of the ElasticSearch APIs provided by these clients are direct translations of the native ElasticSearch REST interface. There should be no guessing required.

--The ElasticSearch team

· They handle dynamic node detection and failover: they are built with a strong networking base to communicate with the cluster.

· They have a full coverage of the REST API. They share the same application approach for every language in which they are available, so switching from one language to another can be done quickly.

· They provide transport abstraction so that a user can plug in to different backends.

· They are easily extensible.

The Python client works well with other Python frameworks such as Django, web2py, and Pyramid. It allows very fast access to documents, indices, and clusters.

In this chapter, besides the standard ElasticSearch client, we will discuss the PyES client developed by me and other contributors since 2010. PyES extends the standard client with a lot of functionalities and helpers, as follows:

· The automatic management of common conversion between types.

· An object-oriented approach to common ElasticSearch elements. The standard client only considers the use of the Python dictionary as a standard element.

· It has helpers for a search, such as advanced iterators on the results and Django-like querysets.

In this chapter, I'll try to describe the most important functionalities of ElasticSearch's official Python client and PyES (https://github.com/aparo/pyes). For additional examples and in-depth references, I suggest that you take a look at the online GitHub repository athttps://github.com/elasticsearch/elasticsearch-py and the documentation.

Creating a client

The official ElasticSearch clients are designed to support several transport layers. They allow you to use HTTP, Thrift, or the Memcached protocol without changing your application code.

The Thrift and Memcached protocols are binary ones and, due to their structures, they are generally a bit faster than the HTTP one. They are wrapped in the REST API and share the same behavior, so switching between these protocols is easy.

In this recipe, we'll see how to instantiate a client with the different protocols.

Getting ready

You need a working ElasticSearch cluster and plugins for extra protocols. The full code of this recipe is in the chapter_11/client_creation.py file, available in the code bundle of this book and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition).

How to do it...

In order to create a client, perform the following steps:

1. Before using the Python client, you need to install it (possibly in a Python virtual environment). The client is officially hosted on PyPi (http://pypi.python.org/) and it's easy to install the client with the pip command:

2. pip install elasticsearch

This standard installation only provides the HTTP protocol.

3. To install the Thrift protocol, you need to install the plugin on the ElasticSearch server:

4. bin/plugin -install elasticsearch/elasticsearch-transport-thrift/2.3.0

On the client side, you need to install the Thrift support for Python, available in the Thrift package (https://pypi.python.org/pypi/thrift/), installable using the pip command:

pip install thrift

5. To install the Memcached protocol, you need to install the plugin on the ElasticSearch server:

6. bin/plugin -install elasticsearch/elasticsearch-transport-memcached/2.3.0

After having installed a plugin, remember to restart your server to load it.

On the client side, we need to install Memcached support for Python provided by the pylibmc package, which is installable via the pip command:

pip install pylibmc

Tip

To compile this library, the libmemcache API must be installed. On Mac OS X, you can install it via a brew install libmemcached, on Linux via the libmemcache-dev package in Debian.

7. After having installed the server and the required libraries to use the protocol, you can instantiate the client. It resides in Python's elasticsearch package and must be imported to instantiate the client, as follows:

8. import elasticsearch

If you don't pass arguments to the ElasticSearch class, it instantiates a client that connects to a localhost and port 9200 (the default ElasticSearch HTTP port):

es = elasticsearch.Elasticsearch()

9. If your cluster is composed of more than one node, you can pass the list of nodes as a round-robin connection between nodes, and distribute the HTTP load, with the following configuration:

es = elasticsearch.Elasticsearch(["search1:9200", "search2:9200"])

10. Often, the complete topology of the cluster is unknown. If you know at least the IP of a node, you can use the option sniff_on_start=True. This option activates the client's ability to detect other nodes in the cluster:

es = elasticsearch.Elasticsearch(["search1:9200"], sniff_on_start=True)

11. The default transport is the Urllib3 HttpConnection but, if you want to use the HTTP requests transport, you need to override the connection_class class by passing a RequestsHttpConnection class:

12.from elasticsearch.connection import RequestsHttpConnection

es = elasticsearch.Elasticsearch( sniff_on_start=True, connection_class= RequestsHttpConnection)

13. If you want to use Thrift as a transport layer, you should import the ThriftConnection class and pass it to the client:

14.from elasticsearch.connection import ThriftConnection

es = elasticsearch.Elasticsearch(["search1:9500"], sniff_on_start=True, connection_class= ThriftConnection)

15. If you want to use Memcached as a transport layer, you should import the MemcachedConnection class and pass it to the client:

16.from elasticsearch import Elasticsearch, MemcachedConnection

es = elasticsearch.Elasticsearch(["search1:11211"], sniff_on_start=True, connection_class=MemcachedConnection)

How it works...

In order to communicate with an ElasticSearch cluster, a client is required.

The client manages all the communication layers from your application to an ElasticSearch server, using the specified protocol. The standard protocol for REST calls is the HTTP protocol.

The ElasticSearch Python client allows you to use one of the following protocols:

· HTTP: This provides two implementations based on requests (https://pypi.python.org/pypi/requests) and one on urllib3 (https://pypi.python.org/pypi/urllib3).

· Thrift: This is one of the fastest protocols available. To use it, Thrift libraries on both the server and client sides must be installed.

· Memcached: This allows you to communicate with ElasticSearch, as if it was a MemCached server. To use it, memcache libraries must be installed on the server and the client.

For general usage, the HTTP protocol is very good and it's the de facto standard. The other protocols too work well because, often, they reuse the same client object so that you don't have to reinstantiate the connections too often. (For more information, in Chapter 1, Getting Started, there is a comparation of the different protocols available).

The ElasticSearch Python client requires a server to connect to. If it is not defined, it tries to use one on the local machine (localhost). If you have more than one node, you can pass a list of servers to connect to.

Note

The client automatically tries to balance the operations on all the cluster nodes. This is a very powerful functionality provided by the ElasticSearch client.

To improve the list of available nodes, it is possible to set the client to autodiscover new nodes. I suggest that you use this feature, because it is common to have a cluster with a lot of nodes and you might need to shut down some of them for maintenance. The following options can be passed to the client in order to control the discovery:

· sniff_on_start (by default, False): This allows you to obtain the list of nodes from the cluser at startup time

· sniffer_timeout (by default, None): This is the number of seconds between the automatic sniffing of the cluster nodes

· sniff_on_connection_fail (by default, False): This senses whether a connection failure will trigger a sniff on the cluster nodes

The default client configuration uses the HTTP protocol via the urllib3 library. If you want to use other transport protocols, you need to pass the type of the transport class to the transport_class variable. These are the currently implemented classes:

· Urllib3HttpConnection (default): This class uses HTTP (usually on port 9200)

· RequestsHttpConnection: This is an alternative to the Urllib3HttpConnection class, based on the requests library

· ThriftConnection: This uses the Thrift protocol (usually on port 9500)

· MemcachedConnection: This uses the Memcached protocol (usually on port 11211)

There's more…

If you need more high-level functionalities than the official client, PyES gives you a more Pythonic (following the Python approach) and object-oriented approach to work with ElasticSearch. PyES is easily installable via the pip command (the more recent version is available on GitHub):

pip install pyes

To initialize a client, you need to import the ES object and instantiate it:

from pyes import ES

es = ES()

The protocol is detected by the URL of the servers' list passed to the constructor. If no server parameter is passed to the constructor, the localhost on port 9200 is used.

The PyES client offers the same connection functionalities as the official client, as described in the previous paragraphs, because it internally uses the official ElasticSearch client.

See also

· PyES on GitHub at https://github.com/aparo/pyes and on PyPI at https://pypi.python.org/pypi/pyes

· The PyES online documentation at http://pythonhosted.org/pyes/

· The Python Thrift library at https://pypi.python.org/pypi/thrift/

· The ElasticSearch Thrift plugin at https://github.com/elasticsearch/elasticsearch-transport-thrift

· ElasticSearch Transport Memcached at https://github.com/elasticsearch/elasticsearch-transport-memcached

· The Python Memcached library at http://pypi.python.org/pypi/pylibmc/1.2.3

Managing indices

In the previous recipe, we saw how to initialize a client in order to send calls to an ElasticSearch cluster. In this recipe, we will see how to manage indices via client calls.

Getting ready

You need a working ElasticSearch cluster and the packages in the Creating a client recipe of this chapter.

The full code of this recipe is in the chapter_11/indices_management.py file, available in the code bundle of this book and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition).

How to do it...

In Python, managing the life cycle of your indices is easy. Perform the following steps:

1. First, initialize a client, as follows:

2. import elasticsearch

3. es = elasticsearch.Elasticsearch()

index_name = "my_index"

4. All the indices' methods are available in the client.indices namespace. You can create and wait for (delay) the creation of an index:

5. es.indices.create(index_name)

es.cluster.health(wait_for_status="yellow")

6. You can close/open an index, as follows:

7. es.indices.close(index_name)

8. es.indices.open(index_name)

es.cluster.health(wait_for_status="yellow")

9. You can optimize an index, as shown here:

es.indices.optimize(index_name)

10. You can delete an index:

es.indices.delete(index_name)

How it works...

The ElasticSearch Python client has two special managers: one for indices (<client>.indices) and one for clusters (<client>.cluster).

For every operation that needs to work with indices, the first value is generally the name of the index. If you need to execute an action on several indices in one go, the indices must be concatenated with a comma (for example, index1,index2,indexN). It's also possible to use glob patterns to define multiple indexes, such as index*.

In PyES, the concatenation is automatically managed.

To create an index, the call requires the index name (index_name); use the following the code:

es.indices.create(index_name)

Other optional parameters are also required, such as index settings and mappings; you will see this advanced feature in the next recipe.

Index creation can take some time (from a few milliseconds to seconds); it is an asynchronous operation and it depends on the complexity of the cluster, the speed of the disk, the network congestion, and so on. To be sure that this action has completed, you need to check whether the cluster's health turns to yellow or green, as follows:

es.cluster.health(wait_for_status="yellow")

Tip

It's a good practice to wait till the cluster status is yellow (at least) after operations that involve the creation and opening of indices, because these actions are asynchronous.

To close an index, the method is <client>.indices.close, which gives the name of the index to be closed:

es.indices.close(index_name)

To open an index, the method is <client>.indices.open, which gives the name of the index to be opened:

es.indices.open(index_name)

es.cluster.health(wait_for_status="yellow")

Similar to index creation, after an index is open, it is a good practice to wait until the index is fully open before you execute an operation on the index. This action is done by checking the cluster's health.

To improve the performance of an index, ElasticSearch allows you to optimize it by removing deleted documents (documents are marked as deleted, but not purged from the segments' index for performance reasons) and reducing the number of segments. To optimize an index, the <client>.indices.optimize method must be called on the index to be optimized:

es.indices.optimize(index_name)

Finally, if you want to delete the index, call the <client>.indices.delete function and give the name of the index to remove it. Remember that deleting an index removes everything related to it, including all the data, and this action cannot be reverted.

The PyES indices management code is the same as the official client code.

See also

· The Creating an index recipe in Chapter 4, Basic Operations

· The Deleting an index recipe in Chapter 4, Basic Operations

· The Opening/closing an index recipe in Chapter 4, Basic Operations

Managing mappings

After creating an index, the next step is to add some type mappings to it. We saw how to put a mapping via the REST API in Chapter 4, Basic Operations. In this recipe, we will see how to manage mappings via the official Python client and PyES.

Getting ready

You need a working ElasticSearch cluster and the required packages that are used in the Creating a client recipe in this chapter.

The code of this recipe is present in chapter_11/mapping_management.py and chapter_11/mapping_management_pyes.py file, which is available in the code bundle of this book and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition).

How to do it...

After you have initialized a client and created an index, the following actions are available in order to manage the indices:

· Creating a mapping

· Retrieving a mapping

· Deleting a mapping

These steps can be easily managed with the following code:

1. Use the following code to initialize the client:

2. import elasticsearch

es = elasticsearch.Elasticsearch()

3. You can create an index as follows:

4. index_name = "my_index"

5. type_name = "my_type"

6. es.indices.create(index_name)

es.cluster.health(wait_for_status="yellow")

7. In order to put a mapping, use the following code:

8. es.indices.put_mapping(index=index_name, doc_type=type_name, body={type_name:{"_type": {"store": "yes"}, "properties": {{

9. "uuid": {"index": "not_analyzed", "type": "string", "store": "yes"},

10. "title": {"index": "analyzed", "type": "string", "store": "yes", "term_vector": "with_positions_offsets"},

11. "parsedtext": {"index": "analyzed", "type": "string", "store": "yes", "term_vector": "with_positions_offsets"},

… truncated…}}})

12. You can retrieve the mapping, as shown here:

mappings = es.indices.get_mapping(index_name, type_name)

13. The mapping can be deleted, as follows:

es.indices.delete_mapping(index_name, type_name)

14. To delete an index, use the following code:

es.indices.delete(index_name)

How it works...

We saw the initialization of the client and index creation in the previous recipe. In order to create a mapping, the call method is <client>.indices.create_mapping, giving the index name, type name, and mapping. Creating a mapping is fully covered in Chapter 3,Managing Mapping. It is easy to convert the standard Python types to JSON and vice versa:

es.indices.put_mapping(index_name, type_name, {…})

If an error is generated in the mapping process, an exception is raised. The put_mapping API has two behaviors: create and update.

Note

In ElasticSearch, you cannot remove a property from a mapping. The schema manipulation allows you to only enter new properties with the PUT mapping call.

To retrieve a mapping with the GET mapping API, use the <client>.indices.get_mapping method by providing the index name and type name:

mappings = es.indices.get_mapping(index_name, type_name)

The returned object is obviously the dictionary that describes the mapping.

To remove a mapping, the method is <client>.indices.delete_mapping; it requires the index name and the type name, as shown here:

es.indices.delete_mapping(index_name, type_name)

Note

Deleting a mapping is a destructive operation: it removes the mapping and all the documents of this type.

There's more…

Creating a mapping using the official ElasticSearch client requires a lot of attention when building the dictionary that defines the mapping.

PyES also provides an object-oriented approach to creating a mapping, reducing the probability of errors in defining the mapping and adding a typed field with useful presets. The previous mapping can be converted in PyES in this way:

from pyes.mappings import *

docmapping = DocumentObjectField(name=mapping_name)

docmapping.add_property(

StringField(name="parsedtext", store=True, term_vector="with_positions_offsets", index="analyzed"))

docmapping.add_property(

StringField(name="name", store=True, term_vector="with_positions_offsets", index="analyzed"))

docmapping.add_property(

StringField(name="title", store=True, term_vector="with_positions_offsets", index="analyzed"))

docmapping.add_property(IntegerField(name="position", store=True))

docmapping.add_property(DateField(name="date", store=True))

docmapping.add_property(StringField(name="uuid", store=True, index="not_analyzed"))

nested_object = NestedObject(name="nested")

nested_object.add_property(StringField(name="name", store=True))

nested_object.add_property(StringField(name="value", store=True))

nested_object.add_property(IntegerField(name="num", store=True))

docmapping.add_property(nested_object)

The following is a list of the main fields:

· DocumentObjectField: This is a document mapping that contains the object properties

· StringField, DateField, IntegerField, LongField, BooleanField: These are the fields that map the respective field type

· ObjectField: This field allows you to map an embedded object field

· NestedObject: This field allows you to map a nested object

· AttachmentField: This field allows you to map the attachment field

· IPField: This field maps the IP field

The object definition of the mapping enforces that, if the types are correctly defined, all the mapping properties are valid.

The PyES GET mapping API does not return a Python dictionary but returns a DocumentObjectField object of the specified mapping, which automatically manages the transformation from dictionary to objects for easy parsing and editing.

See also

· The Putting a mapping in an index recipe in Chapter 4, Basic Operations

· The Getting a mapping recipe in Chapter 4, Basic Operations

· The Deleting a mapping recipe in Chapter 4, Basic Operations

Managing documents

The APIs for managing documents (indexing, updating, and deleting) are the most important APIs after the search ones. In this recipe, we will see how to use them in a standard way and in bulk actions to improve performance.

Getting ready

You need a working ElasticSearch cluster and the packages used in the Creating a client recipe of this chapter.

The full code of this recipe is in the chapter_11/document_management.py and chapter_11/document_management_pyes.py files, available in the code bundle of this book and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition).

How to do it...

There are three main operations to manage documents, as follows:

· index: This stores a document in ElasticSearch. It is mapped on the Index API call.

· update: This allows you to update some values in a document. This operation is composed internally (via Lucene) by deleting the previous documents and reindexing the document with new values. It is mapped to the Update API call.

· delete: This deletes a document from the index. It is mapped to the Delete API call.

With the ElasticSearch Python client, the index, update, and delete operations can be performed using the following steps:

1. First, initialize a client and create an index with the mapping:

2. import elasticsearch

3. from datetime import datetime

4. es = elasticsearch.Elasticsearch()

5.

6. index_name = "my_index"

7. type_name = "my_type"

8.

9. from utils import create_and_add_mapping

create_and_add_mapping(es, index_name, type_name)

10. Then, index a document, as follows:

11.es.index(index=index_name, doc_type=type_name, id=1,

12. body={"name": "Joe Tester", "parsedtext": "Joe Testere nice guy", "uuid": "11111", "position": 1,

13. "date": datetime(2013, 12, 8)})

… truncated…

14. Next, update a document as shown here:

es.update(index=index_name, doc_type=type_name, id=1 body={"script": 'ctx._source.position += 1', "lang": "groovy"})

15. Use the following code to delete a document:

es.delete(index=index_name, doc_type=type_name, id=1)

16. You can insert some documents in bulk, as follows:

17.from elasticsearch.helpers import bulk_index

18.bulk_index(es, [{"name": "Joe Tester", "parsedtext": "Joe Testere nice guy", "uuid": "11111", "position": 1,

19. "date": datetime(2013, 12, 8), "_index":index_name, "_type":type_name, "_id":"1"},

20. {"name": "Bill Baloney", "parsedtext": "Bill Testere nice guy", "uuid": "22222", "position": 2,

21. "date": datetime(2013, 12, 8)}

])

22. Lastly, remove the index:

es.indices.delete(index_name)

How it works...

In order to simplify the example, after having instantiated the client, a function of the utils package, which sets up the index and puts the mapping, is called:

from utils import create_and_add_mapping

create_and_add_mapping(es, index_name, type_name)

This function contains the code used to create the mapping explained in the previous recipe.

To index a document, the method is <client>.index; it needs the name of the index, the type of the document, and the body of the document (if the ID is not provided, it will be autogenerated):

es.index(index=index_name, doc_type=type_name, id=1,

body={"name": "Joe Tester", "parsedtext": "Joe Testere nice guy", "uuid": "11111", "position": 1,

"date": datetime(2013, 12, 8)})

It also accepts all the parameters that we have seen in the REST Index API call in the Indexing a document recipe in Chapter 4, Basic Operations. These are the most common parameters passed to this function:

· id: This provides an ID to be used in order to index the document

· routing: This provides a shard routing to index the document in the specified shard

· parent: This provides a parent ID to be used in order to put the child document in the correct shard

To update a document, the method used is <client>.update, and it requires the following parameters:

· The index name

· The type name

· The ID of the document

· The script or document that is to be updated

· The language to be used (usually, groovy)

The following is the code to update a document:

es.update(index=index_name, doc_type=type_name, id=2, body={"script": 'ctx._source.position += 1', "lang": "groovy"})

Here, the call accepts all the parameters that we have discussed in the Updating a document recipe in Chapter 4, Basic Operations.

To delete a document, the method used is <client>.delete, and it requires the following parameters:

· Index name

· Type name

· ID of the document

You can use the following code to delete a document:

es.delete(index=index_name, doc_type=type_name, id=3)

Tip

Remember that all the ElasticSearch actions that work on a document are never seen instantly in a search. If you want to search without having to wait for the automatic refresh (every second), you need to manually call the Refresh API on the index.

To execute bulk indexing, the ElasticSearch client provides a helper function, which accepts a connection, an iterable list of documents, and the bulk size. The bulk size (by default, 500) defines the number of actions to be sent via a single bulk call. The parameters that must be passed to correctly control the indexing of the document are put in the document with the _ prefix. Generally, these are the special fields:

· _index: This is the name of the index that must be used to store the document

· _type: This is the document type

· _id: This is the ID of the document

The following is the code used to index a document in bulk:

from elasticsearch.helpers import bulk_index

bulk_index(es, [{"name": "Joe Tester", "parsedtext": "Joe Testere nice guy", "uuid": "11111", "position": 1,

"date": datetime(2013, 12, 8), "_index":index_name, "_type":type_name, "_id":"1"},

{"name": "Bill Baloney", "parsedtext": "Bill Testere nice guy", "uuid": "22222", "position": 2,

"date": datetime(2013, 12, 8)}])

There's more…

The previous code can be executed in PyES using the following code:

from pyes import ES

es = ES()

index_name = "my_index"

type_name = "my_type"

from utils_pyes import create_and_add_mapping

create_and_add_mapping(es, index_name, type_name)

es.index(doc={"name": "Joe Tester", "parsedtext": "Joe Testere nice guy", "uuid": "11111", "position": 1},

index=index_name, doc_type=type_name, id=1)

es.index(doc={"name": "data1", "value": "value1"}, index=index_name, doc_type=type_name + "2", id=1, parent=1)

es.index(doc={"name": "Bill Baloney", "parsedtext": "Bill Testere nice guy", "uuid": "22222", "position": 2},

index=index_name, doc_type=type_name, id=2, bulk=True)

… truncated…

es.force_bulk()

es.update(index=index_name, doc_type=type_name, id=2, script='ctx._source.position += 1')

es.update(index=index_name, doc_type=type_name, id=2, script='ctx._source.position += 1', bulk=True)

es.delete(index=index_name, doc_type=type_name, id=1, bulk=True)

es.delete(index=index_name, doc_type=type_name, id=3)

es.force_bulk()

es.indices.refresh(index_name)

es.indices.delete_index(index_name)

The PyES index/update/delete methods are similar to the ElasticSearch official client, with the exception that the document must be put in the doc variable.

In PyES, to execute an action as bulk, the bulk=True parameter must be passed to the index/update/create method. Using the bulk parameter, the body of the action is stored in a ListBulker object that collects elements of all the bulk actions until it is full. When the bulk basket is full (the size is defined during the ES client initialization), the actions are sent to the server and the basket is emptied, ready to accept new documents.

To force the bulk (even if it is not full), you can call the <client>.force_bulk method or you can execute a refresh or flush an index.

See also

· The Indexing a document recipe in Chapter 4, Basic Operations

· The Getting a document recipe in Chapter 4, Basic Operations

· The Deleting a document recipe in Chapter 4, Basic Operations

· The Updating a document recipe in Chapter 4, Basic Operations

· The Speeding up atomic operations (bulk operations) recipe in Chapter 4, Basic Operations

Executing a standard search

After you have inserted documents, the most commonly executed action in ElasticSearch is the search. The official ElasticSearch client APIs that are used to search are similar to the REST API.

Getting ready

You need a working ElasticSearch cluster and the packages used in the Creating a client recipe in this chapter.

The code of this recipe is present in the chapter_11/searching.py and chapter_11/searching_pyes.py files, available in the code bundle of this book and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition).

How to do it...

To execute a standard query, the search client method must be called by passing the query parameters, as shown in Chapter 5, Search, Queries, and Filters. The required parameters are the index name, type name, and query DSL. In this example, you will see how to call a match_all query, a term query, and a filter query. To do this, perform the following steps:

1. First, initialize the client and populate the index:

2. import elasticsearch

3. from pprint import pprint

4.

5. es = elasticsearch.Elasticsearch()

6. index_name = "my_index"

7. type_name = "my_type"

8.

9. from utils import create_and_add_mapping, populate

10.

11.create_and_add_mapping(es, index_name, type_name)

populate(es, index_name, type_name)

12. Then, execute a search with a match_all query and print the results:

13.results = es.search(index_name, type_name, {"query": {"match_all": {}}})

pprint(results)

14. Next, execute a search with a term query and print the results:

15.results = es.search(index_name, type_name, {

16. "query": {

17. "query": {

18. "term": {"name": {"boost": 3.0, "value": "joe"}}}

19. }})

pprint(results)

20. You then need to execute a search with a filtered query and print the results:

21.results = es.search(index_name, type_name, {"query": {

22. "filtered": {

23. "filter": {

24. "or": [

25. {"term": {"position": 1}},

26. {"term": {"position": 2}}]

27. },

28. "query": {"match_all": {}}}}})

pprint(results)

29. Lastly, remove the index, as follows:

es.indices.delete(index_name)

How it works...

The idea behind ElasticSearch official clients is that they should offer a common API that is more similar to REST calls. In Python, it is easy to use the query DSL as it provides an easy mapping from the Python dictionary to JSON objects and vice versa.

In the earlier example, before calling the search, we need to initialize the index and put some data in it. This is done using the two helpers available in the utils package, available in the chapter_11 directory.

The two helpers are as follows:

· create_and_add_mapping(es, index_name, type_name): This initializes the index and puts the correct mapping to perform the search. The code of this function is taken from the Managing mappings recipe in this chapter.

· populate(es, index_name, type_name): This populates the index with data. The code of this function is taken from the previous recipe.

After having initialized some data, we can execute queries against it. To execute a search, the method that must be called is the search on the client. This method accepts all the parameters described for REST calls in the Executing a search recipe in Chapter 5,Search, Queries, and Filters.

This is the actual method signature for the search method:

@query_params('analyze_wildcard', 'analyzer', 'default_operator', 'df', 'explain', 'fields', 'ignore_indices', 'indices_boost', 'lenient', 'lowercase_expanded_terms', 'offset', 'preference', 'q', 'routing', 'scroll', 'search_type', 'size', 'sort', 'source', 'stats', 'suggest_field', 'suggest_mode', 'suggest_size', 'suggest_text', 'timeout', 'version')

def search(self, index=None, doc_type=None, body=None, params=None):

The following can be the index values:

· An index name or an alias name

· A list of index (or alias) names as a string, separated by a comma (for example, index1,index2,indexN)

· The _all special keyword, which indicates all the indices

The type value can be the following:

· A type name

· A list of type names as a string, separated by a comma (for example, type1,type2,typeN)

· None, which indicates all the types

The body is the search DSL, as we have seen in Chapter 5, Search, Queries, and Filters. In the preceding example, we have the following queries:

· A match_all query (see the Matching all the documents recipe of Chapter 5, Search, Queries, and Filters) to match all the index type documents;

results = es.search(index_name, type_name, {"query":{"match_all": {}}})

· A term query that matches the term joe with a boost of 3.0:

· results = es.search(index_name, type_name, {

· "query": {

· "query": {

· "term": {"name": {"boost": 3.0, "value": "joe"}}}

}})

· A filter query with a match_all query and an OR filter with two term filters that match position 1 and 2, as shown here:

· results = es.search(index_name, type_name, {"query": {

· "filtered": {

· "filter": {

· "or": [

· {"term": {"position": 1}},

· {"term": {"position": 2}}]

· },

"query": {"match_all": {}}}}})

The returned result is a JSON dictionary that we have discussed in Chapter 5, Search, Queries, and Filters.

If some hits match, they are returned to the hits field. The standard number of results returned is 10. To return more results, you need to paginate the results with the from and start parameters.

In Chapter 5, Search, Queries, and Filters, there is a definition of all the parameters used in the search.

There's more…

If you are using PyES, you can execute the previous code in a more object-oriented way using queries and filter objects. These objects wrap the low-level code that is normally used to process a query, generating the JSON and validating it during generation. The previous example can be rewritten in PyES with the following code:

… truncated…

from.query import *

from pyes.filters import *

results = es.search(MatchAllQuery(), indices=index_name, doc_types=type_name)

print "total:", results.total

for r in results:

print r

print "first element: ", results[0]

print "slice elements: ", results[1:4]

results = es.search(TermQuery("name", "joe", 3), indices=index_name, doc_types=type_name)

… truncated…

For access to query objects, you need to import the query and filters namespaces:

from pyes.query import *

from pyes.filters import *

To execute a match_all query, use the search client method with the same parameters as the ElasticSearch official client. The main difference is that the body parameter is mapped as a query object in PyES. The following code is used to execute such a match_allquery:

results = es.search(MatchAllQuery(), indices=index_name, doc_types=type_name)

The PyES search method accepts several type of values as a query, as follows:

· A dictionary as the official client

· A query object or a derived class

· A search object that wraps a query and adds additional functionalities related to the search, such as highlighting, suggestting, aggregating, and explaining

The main difference from the official ElasticSearch client is that the returned result is a special ResultSet object that can be iterated. The ResultSet object is a useful helper because of the following reasons:

· It's lazy, so the query is fired only when the results need to be evaluated/iterated.

· It is iterable, so you can traverse all the records automatically by fetching new ones when required. Otherwise, you need to manage the pagination manually. If the size is not defined, you can traverse all the results. If you define the size, you can traverse only the size of object.

· It automatically manages scrolling and scanning queries using a special ResultSet iterator.

· It tries to cache a result range, in order to reduce server usage.

· It can process other extra result manipulations, such as automatic conversion from string to datetime.

For further details on query/filter objects, I suggest that you take a look at the online documentation at http://pythonhosted.org/pyes/.

See also

· The Executing a search recipe in Chapter 5, Search, Queries, and Filters

· The Matching all the documents recipe in Chapter 5, Search, Queries, and Filters

· The PyES online documentation at http://pythonhosted.org/pyes/

Executing a search with aggregations

Searching for results is obviously the main activity of a search engine; thus, aggregations are very important because they often help to augment the results.

Aggregations are executed along with the search by doing an analysis on the searched results.

Getting ready

You need a working ElasticSearch cluster and the packages used in the Creating a client recipe in this chapter.

The code of this recipe is in the chapter_11/aggregation.py and chapter_11/aggregation_pyes.py files, available in the code bundle of this book and on GitHub (https://github.com/aparo/elasticsearch-cookbook-second-edition).

How to do it...

In order to extend a query with aggregations, you need to define an aggregation section similar to what you saw in Chapter 6, Aggregations. In the case of the official ElasticSearch client, you can add the aggregation DSL to the search dictionary in order to provide aggregations results. To do this, perform the following steps:

1. Initialize the client and populate the index, as follows:

2. import elasticsearch

3. from pprint import pprint

4.

5. es = elasticsearch.Elasticsearch()

6. index_name = "my_index"

7. type_name = "my_type"

8.

9. from utils import create_and_add_mapping, populate

10.

11.create_and_add_mapping(es, index_name, type_name)

populate(es, index_name, type_name)

12. Execute a search with a terms aggregation:

13.results = es.search(index_name, type_name,

14. {

15. "query": {"match_all": {}},

16. "aggs": {

17. "pterms": {"terms": {"field": "parsedtext", "size": 10}}

18. }

19. })

pprint(results)

20. Execute a search with a date_histogram aggregation, as shown here:

21.results = es.search(index_name, type_name,

22. {

23. "query": {"match_all": {}},

24. "aggs": {

25. "date_histo": {"date_histogram": {"field": "date", "interval": "month"}}

26. }

27. })

28.pprint(results)

29.

es.indices.delete(index_name)

How it works...

As described in Chapter 6, Aggregations, you can calculate aggregations during the search in a distributed way. When you send a query to ElasticSearch with defined aggregations, it adds an additional step in the query processing, allowing aggregation computation.

In the earlier example, there are two kinds of aggregations: the term aggregation and the date histogram aggregation.

The first one is used to count terms, and it is often seen in sites that provide facet filtering on the term aggregations of results, such as producers, geographic locations, and so on, as shown here:

results = es.search(index_name, type_name,

{

"query": {"match_all": {}},

"aggs": {

"pterms": {"terms": {"field": "parsedtext", "size": 10}}

}

})

The term aggregation requires a field to count on. The default number of buckets for a field that is returned is 10; this value can be changed when defining the size parameter.

The second kind of aggregation that is calculated is the date histogram, which provides hits based on a datetime field. This aggregation requires at least two parameters—that is, the datetime field to be used as the source and the interval to be used for the computation, as shown here:

results = es.search(index_name, type_name,

{

"query": {"match_all": {}},

"aggs": {

"date_histo": {"date_histogram": {"field": "date", "interval": "month"}}

}

})

The search results are standard search responses that we have already seen in Chapter 6, Aggregations.

There's more…

This is how the preceding code can be rewritten in PyES:

from pyes.query import *

from pyes.aggs import *

q = MatchAllQuery()

search = q.search()

search.get_agg_factory().add(TermsAgg('pterms', field="parsedtext"))

results = es.search(search, indices=index_name, doc_types=type_name)

q = MatchAllQuery()

search = q.search()

search.get_agg_factory().add(DateHistogramAgg('date_add',

field='date',

interval='month'))

results = es.search(search, indices=index_name, doc_types=type_name) …

In this case, the code is much more readable. Similar to queries and filters classes, PyES provides aggregation objects that are available in the pyes.aggs namespace.

Because aggregation is a search property and not a query (remember that queries can also be used for delete and count calls), we need to define the aggregation in a search object.

Every query can be converted to a Search object using the .search() method:

q = MatchAllQuery()

search = q.search()

The search object provides a lot of helpers to improve their search experience, as follows:

· AggregationFactory: This is accessible via the agg property to easily build aggregations

· Highlighter: This is accessible via the highlight property to easily build highlight fields

· Sorted: This is accessible via the sort property to add sort fields to a search

· ScriptFields: This is accessible via the script_fields property to add script fields

The AggregationFactory helper easily defines several types of aggregations, as follows:

· add_term: This defines a term aggregation. For example, in the preceding code, we have used the add_term function:

search.agg.add_term ('tag')

· add_date: This defines a date histogram aggregation

· add_geo: This defines a geo distance aggregation

· add: This allows you to add to the aggregation definition for every aggregated object:

· search.add.add(DateHistogramAgg('date_agg',

· field='date',

interval='month'))

After you have executed the query, in the ResultSet response there are calculated aggregations contained in the aggs field (such as, results.aggs).

See also

· The Executing the terms aggregation recipe in Chapter 6, Aggregations

· The Executing the stats aggregation recipe in Chapter 6, Aggregations