Managing Mapping - ElasticSearch Cookbook, Second Edition (2015)

ElasticSearch Cookbook, Second Edition (2015)

Chapter 3. Managing Mapping

In this chapter, we will cover the following topics:

· Using explicit mapping creation

· Mapping base types

· Mapping arrays

· Mapping an object

· Mapping a document

· Using dynamic templates in document mapping

· Managing nested objects

· Managing a child document

· Adding a field with multiple mappings

· Mapping a geo point field

· Mapping a geo shape field

· Mapping an IP field

· Mapping an attachment field

· Adding metadata to a mapping

· Specifying a different analyzer

· Mapping a completion suggester

Introduction

Mapping is an important concept in ElasticSearch, as it defines how the search engine should process a document.

Search engines perform two main operations:

· Indexing: This is the action to receive a document and store/index/process it in an index

· Searching: This is the action to retrieve data from the index

These two operations are closely connected; an error in the indexing step can lead to unwanted or missing search results.

ElasticSearch has explicit mapping on an index/type level. When indexing, if a mapping is not provided, a default mapping is created by guessing the structure from the data fields that compose the document. Then, this new mapping is automatically propagated to all the cluster nodes.

The default type mapping has sensible default values, but when you want to change their behavior or customize several other aspects of indexing (storing, ignoring, completion, and so on), you need to provide a new mapping definition.

In this chapter, we'll see all the possible types that compose the mappings.

Using explicit mapping creation

If you consider an index as a database in the SQL world, a mapping is similar to the table definition.

ElasticSearch is able to understand the structure of the document that you are indexing (reflection) and creates the mapping definition automatically (explicit mapping creation).

Getting ready

You will need a working ElasticSearch cluster, an index named test (see the Creating an index recipe in Chapter 4, Basic Operations), and basic knowledge of JSON.

How to do it...

To create an explicit mapping, perform the following steps:

1. You can explicitly create a mapping by adding a new document in ElasticSearch:

· On a Linux shell:

· #create an index

· curl -XPUT http://127.0.0.1:9200/test

· #{acknowledged":true}

·

· #put a document

· curl -XPUT http://127.0.0.1:9200/test/mytype/1 -d '{"name":"Paul", "age":35}'

· # {"ok":true,"_index":"test","_type":"mytype","_id":"1","_version":1}

·

· #get the mapping and pretty print it

· curl –XGET http://127.0.0.1:9200/test/mytype/_mapping?pretty=true

2. This is how the resulting mapping, autocreated by ElasticSearch, should look:

3. {

4. "mytype" : {

5. "properties" : {

6. "age" : {

7. "type" : "long"

8. },

9. "name" : {

10. "type" : "string"

11. }

12. }

13. }

}

How it works...

The first command line creates an index named test, where you can configure the type/mapping and insert documents.

The second command line inserts a document into the index. (We'll take a look at index creation and record indexing in Chapter 4, Basic Operations.)

During the document's indexing phase, ElasticSearch checks whether the mytype type exists; if not, it creates the type dynamically.

ElasticSearch reads all the default properties for the field of the mapping and starts processing them:

· If the field is already present in the mapping, and the value of the field is valid (that is, if it matches the correct type), then ElasticSearch does not need to change the current mapping.

· If the field is already present in the mapping but the value of the field is of a different type, the type inference engine tries to upgrade the field type (such as from an integer to a long value). If the types are not compatible, then it throws an exception and the index process will fail.

· If the field is not present, it will try to autodetect the type of field; it will also update the mapping to a new field mapping.

There's more...

In ElasticSearch, the separation of documents in types is logical: the ElasticSearch core engine transparently manages it. Physically, all the document types go in the same Lucene index, so they are not fully separated. The concept of types is purely logical and is enforced by ElasticSearch. The user is not bothered about this internal management, but in some cases, with a huge amount of records, this has an impact on performance. This affects the reading and writing of records because all the records are stored in the same index file.

Every document has a unique identifier, called UID, for an index; it's stored in the special _uid field of the document. It's automatically calculated by adding the type of the document to the _id value. (In our example, the _uid value will be mytype#1.)

The _id value can be provided at the time of indexing, or it can be assigned automatically by ElasticSearch if it's missing.

When a mapping type is created or changed, ElasticSearch automatically propagates mapping changes to all the nodes in the cluster so that all the shards are aligned such that a particular type can be processed.

See also

· The Creating an index recipe in Chapter 4, Basic Operations

· The Putting a mapping in an index recipe in Chapter 4, Basic Operations

Mapping base types

Using explicit mapping allows you to be faster when you start inserting data using a schema-less approach, without being concerned about the field types. Therefore, in order to achieve better results and performance when indexing, it's necessary to manually define a mapping.

Fine-tuning the mapping has some advantages, as follows:

· Reduces the size of the index on disk (disabling functionalities for custom fields)

· Indexes only interesting fields (a general boost to performance)

· Precooks data for a fast search or real-time analytics (such as aggregations)

· Correctly defines whether a field must be analyzed in multiple tokens or whether it should be considered as a single token

ElasticSearch also allows you to use base fields with a wide range of configurations.

Getting ready

You need a working ElasticSearch cluster and an index named test (refer to the Creating an index recipe in Chapter 4, Basic Operations) where you can put the mappings.

How to do it...

Let's use a semi-real-world example of a shop order for our ebay-like shop.

Initially, we define the following order:

Name

Type

Description

id

Identifier

Order identifier

date

Date (time)

Date of order

customer_id

Id reference

Customer ID reference

name

String

Name of the item

quantity

Integer

Number of items

vat

Double

VAT for the item

sent

Boolean

Status, if the order was sent

Our order record must be converted to an ElasticSearch mapping definition:

{

"order" : {

"properties" : {

"id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},

"date" : {"type" : "date", "store" : "no" , "index":"not_analyzed"},

"customer_id" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},"sent" : {"type" : "boolean", "index":"not_analyzed"},"name" : {"type" : "string", "index":"analyzed"},"quantity" : {"type" : "integer", "index":"not_analyzed"},"vat" : {"type" : "double", "index":"no"}

}

}

}

Now the mapping is ready to be put in the index. We'll see how to do this in the Putting a mapping in an index recipe in Chapter 4, Basic Operations.

How it works...

The field type must be mapped to one of ElasticSearch's base types, adding options for how the field must be indexed.

The next table is a reference of the mapping types:

Type

ES type

Description

String, VarChar, Text

string

A text field: such as a nice text and CODE0011

Integer

integer

An integer (32 bit): such as 1, 2, 3, 4

Long

long

A long value (64 bit)

Float

float

A floating-point number (32 bit): such as 1, 2, 4, 5

Double

double

A floating-point number (64 bit)

Boolean

boolean

A Boolean value: such as true, false

Date/Datetime

date

A date or datetime value: such as 2013-12-25, 2013-12-25T22:21:20

Bytes/Binary

binary

This is used for binary data such as a file or stream of bytes.

Depending on the data type, it's possible to give explicit directives to ElasticSearch on processing the field for better management. The most-used options are:

· store: This marks the field to be stored in a separate index fragment for fast retrieval. Storing a field consumes disk space, but it reduces computation if you need to extract the field from a document (that is, in scripting and aggregations). The possible values for this option are no and yes (the default value is no).

Note

Stored fields are faster than others at faceting.

· index: This configures the field to be indexed (the default value is analyzed). The following are the possible values for this parameter:

· no: This field is not indexed at all. It is useful to hold data that must not be searchable.

· analyzed: This field is analyzed with the configured analyzer. It is generally lowercased and tokenized, using the default ElasticSearch configuration (StandardAnalyzer).

· not_analyzed: This field is processed and indexed, but without being changed by an analyzer. The default ElasticSearch configuration uses the KeywordAnalyzer field, which processes the field as a single token.

· null_value: This defines a default value if the field is missing.

· boost: This is used to change the importance of a field (the default value is 1.0).

· index_analyzer: This defines an analyzer to be used in order to process a field. If it is not defined, the analyzer of the parent object is used (the default value is null).

· search_analyzer: This defines an analyzer to be used during the search. If it is not defined, the analyzer of the parent object is used (the default value is null).

· analyzer: This sets both the index_analyzer and search_analyzer field to the defined value (the default value is null).

· include_in_all: This marks the current field to be indexed in the special _all field (a field that contains the concatenated text of all the fields). The default value is true.

· index_name: This is the name of the field to be stored in the Index. This property allows you to rename the field at the time of indexing. It can be used to manage data migration in time without breaking the application layer due to changes.

· norms: This controls the Lucene norms. This parameter is used to better score queries, if the field is used only for filtering. Its best practice to disable it in order to reduce the resource usage (the default value is true for analyzed fields and false for thenot_analyzed ones).

There's more...

In this recipe, we saw the most-used options for the base types, but there are many other options that are useful for advanced usage.

An important parameter, available only for string mapping, is the term_vector field (the vector of the terms that compose a string; check out the Lucene documentation for further details athttp://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/Terms.html) to define the details:

· no: This is the default value, which skips term_vector field

· yes: This stores the term_vector field

· with_offsets: This stores term_vector with a token offset (the start or end position in a block of characters)

· with_positions: This stores the position of the token in the term_vector field

· with_positions_offsets: This stores all the term_vector data

Note

Term vectors allow fast highlighting but consume a lot of disk space due to the storage of additional text information. It's best practice to activate them only in the fields that require highlighting, such as title or document content.

See also

· The ElasticSearch online documentation provides a full description of all the properties for the different mapping fields at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-core-types.html

· The Specifying a different analyzer recipe in this chapter shows alternative analyzers to the standard one.

Mapping arrays

An array or a multivalue field is very common in data models (such as multiple phone numbers, addresses, names, aliases, and so on), but it is not natively supported in traditional SQL solutions.

In SQL, multivalue fields require the creation of accessory tables that must be joined in order to gather all the values, leading to poor performance when the cardinality of records is huge.

ElasticSearch, which works natively in JSON, provides support for multivalue fields transparently.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

Every field is automatically managed as an array. For example, in order to store tags for a document, this is how the mapping must be:

{

"document" : {

"properties" : {

"name" : {"type" : "string", "index":"analyzed"},"tag" : {"type" : "string", "store" : "yes" , "index":"not_analyzed"},…

}

}

}

This mapping is valid for indexing this document:

{"name": "document1", "tag": "awesome"}

It can also be used for the following document:

{"name": "document2", "tag": ["cool", "awesome", "amazing"]}

How it works...

ElasticSearch transparently manages the array; there is no difference whether you declare a single value or multiple values, due to its Lucene core nature.

Multiple values for a field are managed in Lucene by adding them to a document with the same field name (index_name in ES). If the index_name field is not defined in the mapping, it is taken from the name of the field. This can also be set to other values for custom behaviors, such as renaming a field at the indexing level or merging two or more JSON fields into a single Lucene field. Redefining the index_name field must be done with caution, as it impacts the search too. For people with a SQL background, this behavior might be strange, but this is a key point in the NoSQL world as it reduces the need for a join query and the need to create different tables to manage multiple values. An array of embedded objects has the same behavior as that of simple fields.

Mapping an object

The object is the base structure (analogous to a record in SQL). ElasticSearch extends the traditional use of objects, allowing the use of recursive embedded objects.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

You can rewrite the mapping of the order type form of the Mapping base types recipe using an array of items:

{

"order" : {

"properties" : {

"id" : {"type" : "string", "store" : "yes", "index":"not_analyzed"},"date" : {"type" : "date", "store" : "no", "index":"not_analyzed"},"customer_id" : {"type" : "string", "store" : "yes","index":"not_analyzed"},"sent" : {"type" : "boolean", "store" : "no","index":"not_analyzed"},

"item" : {

"type" : "object",

"properties" : {

"name" : {"type" : "string", "store" : "no","index":"analyzed"},

"quantity" : {"type" : "integer", "store" : "no","index":"not_analyzed"},

"vat" : {"type" : "double", "store" : "no","index":"not_analyzed"}

}

}

}

}

}

How it works...

ElasticSearch speaks native JSON, so every complex JSON structure can be mapped into it.

When ElasticSearch is parsing an object type, it tries to extract fields and processes them as its defined mapping; otherwise it learns the structure of the object using reflection.

The following are the most important attributes for an object:

· properties: This is a collection of fields or objects (we consider them as columns in the SQL world).

· enabled: This is enabled if the object needs to be processed. If it's set to false, the data contained in the object is not indexed as it cannot be searched (the default value is true).

· dynamic: This allows ElasticSearch to add new field names to the object using reflection on the values of inserted data (the default value is true). If it's set to false, when you try to index an object containing a new field type, it'll be rejected silently. If it's set tostrict, when a new field type is present in the object, an error is raised and the index process is skipped. Controlling the dynamic parameter allows you to be safe about changes in the document structure.

· include_in_all: This adds the object values (the default value is true) to the special _all field (used to aggregate the text of all the document fields).

The most-used attribute is properties, which allows you to map the fields of the object in ElasticSearch fields.

Disabling the indexing part of the document reduces the index size; however, the data cannot be searched. In other words, you end up with a smaller file on disk, but there is a cost incurred in functionality.

There's more...

There are other properties also which are also rarely used, such as index_name and path, which change how Lucene indexes the object, modifying the index's inner structure.

See also

Special objects, which are described in the Mapping a document, Managing a child document, and Mapping a nested objects recipes in this chapter.

Mapping a document

The document is also referred to as the root object. It has special parameters that control its behavior, which are mainly used internally to do special processing, such as routing or managing the time-to-live of documents.

In this recipe, we'll take a look at these special fields and learn how to use them.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

You can extend the preceding order example by adding some special fields, as follows:

{

"order": {

"_id": {

"path": "order_id"

},

"_type": {

"store": "yes"

},

"_source": {

"store": "yes"

},

"_all": {

"enable": false

},

"_analyzer": {

"path": "analyzer_field"

},

"_boost": {

"null_value": 1.0

},

"_routing": {

"path": "customer_id",

"required": true

},

"_index": {

"enabled": true

},

"_size": {

"enabled": true,

"store": "yes"

},

"_timestamp": {

"enabled": true,

"store": "yes",

"path": "date"

},

"_ttl": {

"enabled": true,

"default": "3y"

},

"properties": {

… truncated ….

}

}

}

How it works...

Every special field has its own parameters and value options, as follows:

· _id (by default, it's not indexed or stored): This allows you to index only the ID part of the document. It can be associated with a path field that will be used to extract the id from the source of the document:

· "_id" : {

· "path" : "order_id"

},

· _type (by default, it's indexed and not stored): This allows you to index the type of the document.

· _index (the default value is enabled=false): This controls whether the index must be stored as part of the document. It can be enabled by setting the parameter as enabled=true.

· _boost (the default value is null_value=1.0): This controls the boost (the value used to increment the score) level of the document. It can be overridden in the boost parameter for the field.

· _size (the default value is enabled=false): This controls the size if it stores the size of the source record.

· _timestamp (by default, enabled=false): This automatically enables the indexing of the document's timestamp. If given a path value, it can be extracted by the source of the document and used as a timestamp value. It can be queried as a standard datetime.

· _ttl (by default, enabled=false): The time-to-live parameter sets the expiration time of the document. When a document expires, it will be removed from the index. It allows you to define an optional parameter, default, to provide a default value for a type level.

· _all (the default is enabled=true): This controls the creation of the _all field (a special field that aggregates all the text of all the document fields). Because this functionality requires a lot of CPU and storage, if it is not required it is better to disable it.

· _source (by default, enabled=true): This controls the storage of the document source. Storing the source is very useful, but it's a storage overhead; so, if it is not required, it's better to turn it off.

· _parent: This defines the parent document (see the Mapping a child document recipe in this chapter).

· _routing: This controls in which shard the document is to be stored. It supports the following additional parameters:

· path: This is used to provide a field to be used for routing (customer_id in the earlier example).

· required (true/false): This is used to force the presence of the routing value, raising an exception if it is not provided

· _analyzer: This allows you to define a document field that contains the name of the analyzer to be used for fields that do not explicitly define an analyzer or an index_analyzer.

The power of control to index and process a document is very important and allows you to resolve issues related to complex data types.

Every special field has parameters to set a particular configuration, and some of their behaviors may change in different releases of ElasticSearch.

See also

· The Using dynamic templates in document mapping recipe in this chapter

· The Putting a mapping in an index recipe in Chapter 4, Basic Operations

Using dynamic templates in document mapping

In the Using explicit mapping creation recipe, we saw how ElasticSearch is able to guess the field type using reflection. In this recipe, we'll see how we can help it to improve its guessing capabilities via dynamic templates.

The dynamic template feature is very useful, for example, if you need to create several indices with similar types, because it allows you to remove the need to define mappings from coded initial routines to automatic index document creation. A typical use is to define types for logstash log indices.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

You can extend the previous mapping by adding document-related settings:

{

"order" : {

"index_analyzer":"standard",

"search_analyzer":"standard",

"dynamic_date_formats":["yyyy-MM-dd", "dd-MM-yyyy"],

"date_detection":true,

"numeric_detection":true,

"dynamic_templates":[

{"template1":{

"match":"*",

"match_mapping_type":"long",

"mapping":{"type":" {dynamic_type}", "store":true}

}}

],

"properties" : {…}

}

}

How it works...

The Root object (document) controls the behavior of its fields and all its child object fields.

In the document mapping, you can define the following fields:

· index_analyzer: This defines the analyzer to be used for indexing within this document. If an index_analyzer field is not defined in a field, then the field is taken as the default.

· search_analyzer: This defines the analyzer to be used for searching. If a field doesn't define an analyzer, the search_analyzer field of the document, if available, is taken.

Tip

If you need to set the index_analyzer and search_analyzer field with the same value, you can use the analyzer property.

· date_detection (by default true): This enables the extraction of a date from a string.

· dynamic_date_formats: This is a list of valid date formats; it's used if date_detection is active.

· numeric_detection (by default false): This enables you to convert strings to numbers, if it is possible.

· dynamic_templates: This is a list of the templates used to change the explicit mapping, if one of these templates is matched. The rules defined in it are used to build the final mapping.

A dynamic template is composed of two parts: the matcher and the mapping.

In order to match a field to activate the template, several types of matchers are available:

· match: This allows you to define a match on the field name. The expression is a standard glob pattern (http://en.wikipedia.org/wiki/Glob_(programming).

· unmatch (optional): This allows you to define the expression to be used to exclude matches.

· match_mapping_type (optional): This controls the types of the matched fields. For example, string, integer, and so on.

· path_match (optional): This allows you to match the dynamic template against the full dot notation of the field. For example, obj1.*.value.

· path_unmatch (optional): This does the opposite of path_match, that is, excluding the matched fields.

· match_pattern (optional): This allows you to switch the matchers to regex (regular expression); otherwise, the glob pattern match is used.

The dynamic template mapping part is standard, but with the ability to use special placeholders as follows:

· {name}:This will be replaced with the actual dynamic field name

· {dynamic_type}: This will be replaced with the type of the matched field

Tip

The order of the dynamic templates is very important. Only the first one that matches is executed. It is a good practice to order the ones with stricter rules first, followed by the other templates.

There's more...

The dynamic template is very handy when you need to set a mapping configuration for all the fields. This action can be performed by adding a dynamic template similar to this one:

"dynamic_templates" : [

{

"store_generic" : {

"match" : "*",

"mapping" : {

"store" : "yes"

}

}

}

]

In this example, all the new fields, which will be added with the explicit mapping, will be stored.

See also

· The Using explicit mapping creation recipe in this chapter

· The Mapping a document recipe in this chapter

· The Glob pattern at http://en.wikipedia.org/wiki/Glob_pattern

Managing nested objects

There is a special type of embedded object: the nested object. This resolves problems related to Lucene indexing architecture, in which all the fields of the embedded objects are viewed as a single object. During a search in Lucene, it is not possible to distinguish between the values of different embedded objects in the same multivalued array.

If we consider the previous order example, it's not possible to distinguish between an item name and its quantity with the same query, as Lucene puts them in the same Lucene document object. We need to index them in different documents and to join them. This entire trip is managed by nested objects and nested queries.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

A nested object is defined as a standard object with the type nested.

From the example in the Mapping an object recipe in this chapter, we can change the type from object to nested as follows:

{

"order" : {

"properties" : {

"id" : {"type" : "string", "store" : "yes", "index":"not_analyzed"},

"date" : {"type" : "date", "store" : "no", "index":"not_analyzed"},"customer_id" : {"type" : "string", "store" : "yes","index":"not_analyzed"},

"sent" : {"type" : "boolean", "store" : "no", "index":"not_analyzed"},"item" : {

"type" : "nested",

"properties" : {

"name" : {"type" : "string", "store" : "no","index":"analyzed"},

"quantity" : {"type" : "integer", "store" : "no","index":"not_analyzed"},

"vat" : {"type" : "double", "store" : "no","index":"not_analyzed"}

}

}

}

}

}

How it works...

When a document is indexed, if an embedded object is marked as nested, it's extracted by the original document and indexed in a new external document.

In the above example, we have reused the mapping of the previous recipe, Mapping an Object, but we have changed the type of the item from object to nested. No other action must be taken to convert an embedded object to a nested one.

Nested objects are special Lucene documents that are saved in the same block of data as their parents — this approach allows faster joining with the parent document.

Nested objects are not searchable with standard queries, but only with nested ones. They are not shown in standard query results.

The lives of nested objects are related to their parents; deleting/updating a parent automatically deletes/updates all the nested children. Changing the parent means ElasticSearch will do the following:

· Mark old documents that are deleted

· Mark all nested documents that are deleted

· Index the new document's version

· Index all nested documents

There's more...

Sometimes, it is necessary to propagate information about nested objects to their parents or their root objects, mainly to build simpler queries about their parents. To achieve this goal, the following two special properties of nested objects can be used:

· include_in_parent: This allows you to automatically add the nested fields to the immediate parent

· include_in_root: This adds the nested objects' fields to the root object

These settings add to data redundancy, but they reduce the complexity of some queries, improving performance.

See also

· The Managing a child document recipe in this chapter

Managing a child document

In the previous recipe, you saw how it's possible to manage relationships between objects with the nested object type. The disadvantage of using nested objects is their dependency on their parent. If you need to change the value of a nested object, you need to reindex the parent (this brings about a potential performance overhead if the nested objects change too quickly). To solve this problem, ElasticSearch allows you to define child documents.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

You can modify the mapping of the order example from the Mapping a document recipe by indexing the items as separate child documents.

You need to extract the item object and create a new type of document item with the _parent property set:

{

"order": {

"properties": {

"id": {

"type": "string",

"store": "yes",

"index": "not_analyzed"

},

"date": {

"type": "date",

"store": "no",

"index": "not_analyzed"

},

"customer_id": {

"type": "string",

"store": "yes",

"index": "not_analyzed"

},

"sent": {

"type": "boolean",

"store": "no",

"index": "not_analyzed"

}

}

},

"item": {

"_parent": {

"type": "order"

},

"properties": {

"name": {

"type": "string",

"store": "no",

"index": "analyzed"

},

"quantity": {

"type": "integer",

"store": "no",

"index": "not_analyzed"

},

"vat": {

"type": "double",

"store": "no",

"index": "not_analyzed"

}

}

}

}

The preceding mapping is similar to the mapping shown in the previous recipes. The item object is extracted from the order (in the previous example, it was nested) and added as a new mapping. The only difference is that "type": "nested" becomes "type": "object"(it can be omitted) and there is a new special field, _parent, which defines the parent-child relation.

How it works...

The child object is a standard root object (document) with an extra property defined, which is _parent.

The type property of _parent refers to the type of parent document.

The child document must be indexed in the same shard as the parent, so that when it is indexed, an extra parameter must be passed: parent id. (We'll see how to do this in later chapters.)

Child documents don't require you to reindex the parent document when you want to change their values, so they are faster for indexing, reindexing (updating), and deleting.

There's more...

In ElasticSearch, there are different ways in which you can manage relationships between objects:

· Embedding with type=object: This is implicitly managed by ElasticSearch, and it considers the embedded as part of the main document. It's fast but you need to reindex the main document to change a value of the embedded object.

· Nesting with type=nested: This allows a more accurate search and filtering of the parent, using a nested query on the children. Everything works as in the case of an embedded object, except for the query.

· External child documents: This is a document in which the children are external documents, with a _parent property to bind them to the parent. They must be indexed in the same shard as the parent. The join with the parent is a bit slower than with the nested one, because the nested objects are in the same data block as the parent in the Lucene index and they are loaded with the parent; otherwise the child documents require more read operations.

Choosing how to model the relationship between objects depends on your application scenario.

There is another approach that can be used, but only on big data documents, which brings poor performance as it's a decoupling join relation. You have to do the join query in two steps: first, you collect the ID of the children/other documents and then you search them in a field of their parent.

See also

· The Using a has_child query/filter, Using a top_children query, and Using a has_parent query/filter recipes in Chapter 5, Search, Queries, and Filters, for more information on child/parent queries.

Adding a field with multiple mappings

Often, a field must be processed with several core types or in different ways. For example, a string field must be processed as analyzed for search and as not_analyzed for sorting. To do this, you need to define a multi_field special property called fields.

Note

In the previous ElasticSearch versions (prior to 1.x), there was the multi_field type, but this has now deprecated and will be removed in favor of the fields property.

The fields property is a very powerful feature of mapping because it allows you to use the same field in different ways.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

To define a multifields property, you need to:

1. Use field as a type – define the main field type, as we saw in the previous sections.

2. Define a dictionary that contains subfields called fields. The subfield with the same name as the parent field is the default one.

If you consider the item of your order example, you can index the name in this way:

"name": {

"type": "string",

"index": "not_analyzed",

"fields": {

"name": {

"type": "string",

"index": "not_analyzed"

},

"tk": {

"type": "string",

"index": "analyzed"

},

"code": {

"type": "string",

"index": "analyzed",

"analyzer": "code_analyzer"

}

}

},

If you already have a mapping stored in ElasticSearch and want to migrate the fields in a fields property, it's enough to save a new mapping with a different type, and ElasticSearch provides the merge automatically. New subfields in the fields property can be added without a problem at any moment, but the new subfields will be available only to new indexed documents.

How it works...

During indexing, when ElasticSearch processes a fields property, it reprocesses the same field for every subfield defined in the mapping.

To access the subfields of a multifield, we have a new path value built on the base field plus the subfield name. If you consider the earlier example, you have:

· name: This points to the default field subfield (the not_analyzed subfield)

· name.tk: This points to the standard analyzed (tokenized) field

· name.code: This points to a field analyzed with a code extractor analyzer

In the earlier example, we changed the analyzer to introduce a code extractor analyzer that allows you to extract the item code from a string.

Using the fields property, if you index a string such as Good Item to buy - ABC1234 you'll have:

· name = "Good Item to buy - ABC1234" (useful for sorting)

· name.tk=["good", "item", "to", "buy", "abc1234"] (useful for searching)

· name.code = ["ABC1234"] (useful for searching and faceting)

There's more...

The fields property is very useful for data processing, because it allows you to define several ways to process a field's data.

For example, if you are working on a document content, you can define analyzers to extract names, places, date/time, geolocation, and so on as subfields.

The subfields of a multifield are standard core type fields; you can perform every process you want on them such as search, filter, facet, and scripting.

See also

· The Specifying a different analyzer recipe in this chapter

Mapping a geo point field

ElasticSearch natively supports the use of geolocation types: special types that allow you to localize your document in geographic coordinates (latitude and longitude) around the world.

There are two main document types used in the geographic world: point and shape. In this recipe, we'll see geo point, the base element of geolocation.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

The type of the field must be set to geo_point in order to define a geo point.

You can extend the earlier order example by adding a new field that stores the location of a customer. The following will be the result:

{

"order": {

"properties": {

"id": {

"type": "string",

"store": "yes",

"index": "not_analyzed"

},

"date": {

"type": "date",

"store": "no",

"index": "not_analyzed"

},

"customer_id": {

"type": "string",

"store": "yes",

"index": "not_analyzed"

},

"customer_ip": {

"type": "ip",

"store": "yes",

"index": "not_analyzed"

},

"customer_location": {

"type": "geo_point",

"store": "yes"

},

"sent": {

"type": "boolean",

"store": "no",

"index": "not_analyzed"

}

}

}

}

How it works...

When ElasticSearch indexes a document with a geo point field (latitude, longitude), it processes the latitude and longitude coordinates and creates special accessory field data to quickly query these coordinates.

Depending on the properties, given a latitude and longitude it's possible to compute the geohash value (http://en.wikipedia.org/wiki/Geohash). The index process also optimizes these values for special computation, such as distance and ranges, and in a shape match.

Geo point has special parameters that allow you to store additional geographic data:

· lat_lon (by default, false): This allows you to store the latitude and longitude in the .lat and .lon fields. Storing these values improves performance in many memory algorithms used in distance and shape calculus.

Note

It makes sense to store values only if there is a single point value for a field, in multiple values.

· geohash (by default, false): This allows you to store the computed geohash value.

· geohash_precision (by default, 12): This defines the precision to be used in a geohash calculus. For example, given a geo point value [45.61752, 9.08363], it will store:

· customer_location = "45.61752, 9.08363"

· customer_location.lat = 45.61752

· customer_location.lon = 9.08363

· customer_location.geohash = "u0n7w8qmrfj"

There's more...

Geo point is a special type and can accept several formats as input:

· Latitude and longitude as properties:

· "customer_location": {

· "lat": 45.61752,

· "lon": 9.08363

},

· Latitude and longitude as a string:

"customer_location": "45.61752,9.08363",

· Latitude and longitude as geohash string

· Latitude and longitude as a GeoJSON array (note that in this latitude and longitude are reversed):

"customer_location": [9.08363, 45.61752]

Mapping a geo shape field

An extension to the concept of point is shape. ElasticSearch provides a type that facilitates the management of arbitrary polygons: the geo shape.

Getting ready

You need a working ElasticSearch cluster with Spatial4J (v0.3) and JTS (v1.12) in the classpath to use this type.

How to do it...

In order to map a geo_shape type, a user must explicitly provide some parameters:

· tree (by default, geohash): This is the name of the prefix tree implementation called geohash for GeohashPrefixTree and quadtree for QuadPrefixTree.

· precision: This is used instead of tree_levels to provide a more human value to be used in the tree level. The precision number can be followed by the unit, such as 10 m, 10 km, 10 miles, and so on.

· tree_levels: This is the maximum number of layers to be used in the prefix tree.

· distance_error_pct (the default is 0,025% and the maximum value is 0,5%): This sets the maximum number of errors allowed in PrefixTree.

The customer_location mapping, which we have seen in the previous recipe using geo_shape, will be:

"customer_location": {

"type": "geo_shape",

"tree": "quadtree",

"precision": "1m"

},

How it works...

When a shape is indexed or searched internally, a path tree is created and used.

A path tree is a list of terms that contain geographic information, computed to improve performance in evaluating geometric calculus.

The path tree also depends on the shape type, such as point, linestring, polygon, multipoint, and multipolygon.

See also

· To fully understand the logic behind geo shape, some good resources are the ElasticSearch page about geo shape and the sites of the libraries used for geographic calculus (https://github.com/spatial4j/spatial4j andhttp://www.vividsolutions.com/jts/jtshome.htm).

Mapping an IP field

ElasticSearch is used to collect and search logs in a lot of systems, such as Kibana (http://www.elasticsearch.org/overview/kibana/ or http://kibana.org/) and logstash (http://www.elasticsearch.org/overview/logstash/ or http://logstash.net/). To improve searching in these scenarios, it provides the IPv4 type that can be used to store IP addresses in an optimized way.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

You need to define the type of the field that contains an IP address as "ip".

Using the preceding order example, you can extend it by adding the customer IP:

"customer_ip": {

"type": "ip",

"store": "yes"

}

The IP must be in the standard point notation form, as shown in the following code:

"customer_ip":"19.18.200.201"

How it works...

When ElasticSearch is processing a document, if a field is an IP one, it tries to convert its value to a numerical form and generate tokens for fast value searching.

The IP has special properties:

· index: This defines whether the field should be indexed. Otherwise, no value must be set

· precision_step (by default, 4): This defines the number of terms that must be generated for its original value

The other properties (store, boot, null_value, and include_in_all) work as other base types.

The advantages of using IP fields over string fields are: faster speed in every range, improved filtering, and lower resource usage (disk and memory).

Mapping an attachment field

ElasticSearch allows you to extend its core types to cover new requirements with native plugins that provide new mapping types. The most-used custom field type is the attachment mapping type.

It allows you to index and search the contents of common documental files, such as Microsoft Office formats, open document formats, PDF, epub, and many others.

Getting ready

You need a working ElasticSearch cluster with the attachment plugin (https://github.com/elasticsearch/elasticsearch-mapper-attachments) installed.

It can be installed from the command line with the following command:

bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.9.0

The plugin version is related to the current ElasticSearch version; check the GitHub page for further details.

How to do it...

To map a field as an attachment, it's necessary to set the type field to attachment.

Internally, the attachment field defines the fields property as a multifield that takes some binary data (encoded base64) and extracts useful information such as author, content, title, date, and so on.

If you want to create a mapping for an e-mail storing attachment, it should be as follows:

{

"email": {

"properties": {

"sender": {

"type": "string",

"store": "yes",

"index": "not_analyzed"

},

"date": {

"type": "date",

"store": "no",

"index": "not_analyzed"

},

"document": {

"type": "attachment",

"fields": {

"file": {

"store": "yes",

"index": "analyzed"

},

"date": {

"store": "yes"

},

"author": {

"store": "yes"

},

"keywords": {

"store": "yes"

},

"content_type": {

"store": "yes"

},

"title": {

"store": "yes"

}

}

}

}

}

}

How it works...

The attachment plugin uses Apache Tika internally, a library that specializes in text extraction from documents. The list of supported document types is available on the Apache Tika site (http://tika.apache.org/1.5/formats.html), but it covers all the common file types.

The attachment type field receives a base64 binary stream that is processed by Tika metadata and text extractor. The field can be seen as a multifield that stores different contents in its subfields:

· file: This stores the content of the file

· date: This stores the file creation data extracted by Tika metadata

· author: This stores the file's author extracted by Tika metadata

· keywords: This stores the file's keywords extracted by Tika metadata

· content_type: This stores the file's content type

· title: This stores the file's title extracted by Tika metadata

The default setting for an attachment plugin is to extract 100,000 characters. This value can be changed globally by setting the index settings to index.mappings.attachment.indexed_chars or by passing a value to the _indexed_chars property when indexing the element.

There's more...

The attachment type is an example of how it's possible to extend ElasticSearch with custom types.

The attachment plugin is very useful for indexing documents, e-mails, and all types of unstructured documents. A good example of an application that uses this plugin is ScrutMyDocs (http://www.scrutmydocs.org/).

See also

· The official attachment plugin page at https://github.com/elasticsearch/elasticsearch-mapper-attachments

· The Tika library page at http://tika.apache.org

· The ScrutMyDocs website at http://www.scrutmydocs.org/

Adding metadata to a mapping

Sometimes, when working with a mapping, you need to store some additional data to be used for display purposes, ORM facilities, and permissions, or you simply need to track them in the mapping.

ElasticSearch allows you to store any kind of JSON data you want in the mapping with the _meta special field.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

The _meta mapping field can be populated with any data you want:

{

"order": {

"_meta": {

"attr1": ["value1", "value2"],

"attr2": {

"attr3": "value3"

}

}

}

}

How it works...

When ElasticSearch processes a mapping and finds a _meta field, it stores the field in the global mapping status and propagates the information to all the cluster nodes.

The _meta field is only used for storage purposes; it's not indexed or searchable. It can be used to do the following:

· Storing type metadata

· Storing ORM (Object Relational Mapping) related information

· Storing type permission information

· Storing extra type information (such as the icon or filename used to display the type)

· Storing template parts to render web interfaces

Specifying a different analyzer

In the previous recipes, we saw how to map different fields and objects in ElasticSearch and described how easy it is to change the standard analyzer with the analyzer, index_analyzer, and search_analyzer properties.

In this recipe, we will see several analyzers and how to use them in order to improve the quality of indexing and searching.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

Every core type field allows you to specify a custom analyzer for indexing and searching as field parameters.

For example, if you want the name field to use a standard analyzer for indexing and a simple analyzer for searching, the mapping will be as follows:

{

"name": {

"type": "string",

"index": "analyzed",

"index_analyzer": "standard",

"search_analyzer": "simple"

}

}

How it works...

The concept of an analyzer comes from Lucene (the core of ElasticSearch). An analyzer is a Lucene element that is composed of a tokenizer, which splits a text into tokens, and one or more token filters, which perform token manipulation – such as lowercasing, normalization, removing stopwords, stemming, and so on.

During the indexing phase, when ElasticSearch processes a field that must be indexed, an analyzer is chosen by first checking whether it is defined in the field (index_analyzer), then in the document, and finally in the index.

Choosing the correct analyzer is essential to get good results during the query phase.

ElasticSearch provides several analyzers in its standard installation. In the following table, the most common analyzers are described:

Name

Description

standard

This divides text using a standard tokenizer, normalized tokens, and lowercase tokens, and also removes unwanted tokens

simple

This divides text and converts them to lowercase

whitespace

This divides text at spaces

stop

This processes the text with a standard analyzer and then applies custom stopwords

keyword

This considers all text as a token

pattern

This divides text using a regular expression

snowball

This works as a standard analyzer plus a stemming at the end of processing

For special language purposes, ElasticSearch supports a set of analyzers that are aimed at analyzing specific language text, such as Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, CKJ, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Italian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.

See also

There are several ElasticSearch plugins that extend the list of available analyzers. Checkout the plugins at GitHub. The following are the most famous ones:

· ICU analysis plugin (https://github.com/elasticsearch/elasticsearch-analysis-icu)

· Morphological Analysis Plugin (https://github.com/imotov/elasticsearch-analysis-morphology)

· Phonetic Analysis Plugin (https://github.com/elasticsearch/elasticsearch-analysis-phonetic)

· Smart Chinese Analysis Plugin (https://github.com/elasticsearch/elasticsearch-analysis-smartcn)

· Japanese (kuromoji) Analysis Plugin (https://github.com/elasticsearch/elasticsearch-analysis-kuromoji)

Mapping a completion suggester

In order to provide search functionalities for your user, one of the most common requirements is to provide text suggestions for your query.

ElasticSearch provides a helper to archive this functionality via a special type mapping called completion.

Getting ready

You need a working ElasticSearch cluster.

How to do it...

The definition of a completion field is similar to that of the previous core type fields. For example, to provide suggestions for a name with an alias, you can write a similar mapping:

{

"name": {"type": "string", "copy_to":["suggest"]},

"alias": {"type": "string", "copy_to":["suggest"]},

"suggest": {

"type": "complection",

"payloads": true,

"index_analyzer": "simple",

"search_analyzer": "simple"

}

}

In this example, we have defined two string fields, name and alias, and a suggest completer for them.

How it works...

There are several ways in which you can provide a suggestion in ElasticSearch. You can have the same functionality using some queries with wildcards or prefixes, but the completion fields are much faster due to the natively optimized structures used.

Internally, ElasticSearch builds a Finite state transducer (FST) structure to suggest terms. (The topic is described in great detail on its Wikipedia page at http://en.wikipedia.org/wiki/Finite_state_transducer.)

The following are the most important properties that can be configured to use the completion field:

· index_analyzer (by default, simple): This defines the analyzer to be used for indexing within the document. The default is simple, to keep stopwords, such as at, the, of, and so, in suggested terms.

· search_analyzer (by default, simple): This defines the analyzer to be used for searching.

· preserve_separators (by default, true): This controls how tokens are processed. If it is disabled, the spaces are trimmed in the suggestion, which allows it to match fightc as fight club.

· max_input_length (by default, 50): This property reduces the characters in the input string to reduce the suggester size. The trial in suggesting the longest text is nonsense because it is against the usability.

· payloads (by default, false): This allows you to store payloads (additional items' values to be returned). For example, it can be used to return a product in an SKU:

· curl -X PUT 'http://localhost:9200/myindex/mytype/1' -d '{

· "name" : "ElasticSearch Cookbook",

· "suggest" : {

· "input": ["ES", "ElasticSearch", "Elastic Search", "Elastic Search Cookbook" ],

· "output": "ElasticSearch Cookbook",

· "payload" : { "isbn" : "1782166629" },

· "weight" : 34

· }

}'

In the previous example, you can see the following functionalities that are available during indexing for the completion field:

· input: This manages a list of provided values that can be used for suggesting. If you are able to enrich your data, this can improve the quality of your suggester.

· output (optional): This is the result to be shown from the desired suggester.

· payload (optional): This is some extra data to be returned.

· weight (optional): This is a weight boost to be used to score the suggester.

At the start of the recipe, I showed a shortcut by using the copy_to field property to populate the completion field from several fields. The copy_to property simply copies the content of one field to one or more others fields.

See also

In this recipe, we only discussed the mapping and indexing functionality of completion; the search part will be discussed in the Suggesting a correct query recipe in Chapter 5, Search, Queries, and Filters.