Mapping and Analysis - Getting Started - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part I. Getting Started

Chapter 6. Mapping and Analysis

While playing around with the data in our index, we notice something odd. Something seems to be broken: we have 12 tweets in our indices, and only one of them contains the date 2014-09-15, but have a look at the total hits for the following queries:

GET /_search?q=2014 # 12 results

GET /_search?q=2014-09-15 # 12 results !

GET /_search?q=date:2014-09-15 # 1 result

GET /_search?q=date:2014 # 0 results !

Why does querying the _all field for the full date return all tweets, and querying the date field for just the year return no results? Why do our results differ when searching within the _all field or the date field?

Presumably, it is because the way our data has been indexed in the _all field is different from how it has been indexed in the date field. So let’s take a look at how Elasticsearch has interpreted our document structure, by requesting the mapping (or schema definition) for the tweet type in the gb index:

GET /gb/_mapping/tweet

This gives us the following:

{

"gb": {

"mappings": {

"tweet": {

"properties": {

"date": {

"type": "date",

"format": "dateOptionalTime"

},

"name": {

"type": "string"

},

"tweet": {

"type": "string"

},

"user_id": {

"type": "long"

}

}

}

}

}

}

Elasticsearch has dynamically generated a mapping for us, based on what it could guess about our field types. The response shows us that the date field has been recognized as a field of type date. The _all field isn’t mentioned because it is a default field, but we know that the _all field is of type string.

So fields of type date and fields of type string are indexed differently, and can thus be searched differently. That’s not entirely surprising. You might expect that each of the core data types—strings, numbers, Booleans, and dates—might be indexed slightly differently. And this is true: there are slight differences.

But by far the biggest difference is between fields that represent exact values (which can include string fields) and fields that represent full text. This distinction is really important—it’s the thing that separates a search engine from all other databases.

Exact Values Versus Full Text

Data in Elasticsearch can be broadly divided into two types: exact values and full text.

Exact values are exactly what they sound like. Examples are a date or a user ID, but can also include exact strings such as a username or an email address. The exact value Foo is not the same as the exact value foo. The exact value 2014 is not the same as the exact value 2014-09-15.

Full text, on the other hand, refers to textual data—usually written in some human language — like the text of a tweet or the body of an email.

NOTE

Full text is often referred to as unstructured data, which is a misnomer—natural language is highly structured. The problem is that the rules of natural languages are complex, which makes them difficult for computers to parse correctly. For instance, consider this sentence:

May is fun but June bores me.

Does it refer to months or to people?

Exact values are easy to query. The decision is binary; a value either matches the query, or it doesn’t. This kind of query is easy to express with SQL:

WHERE name = "John Smith"

AND user_id = 2

AND date > "2014-09-15"

Querying full-text data is much more subtle. We are not just asking, “Does this document match the query” but “How well does this document match the query?” In other words, how relevant is this document to the given query?

We seldom want to match the whole full-text field exactly. Instead, we want to search within text fields. Not only that, but we expect search to understand our intent:

§ A search for UK should also return documents mentioning the United Kingdom.

§ A search for jump should also match jumped, jumps, jumping, and perhaps even leap.

§ johnny walker should match Johnnie Walker, and johnnie depp should match Johnny Depp.

§ fox news hunting should return stories about hunting on Fox News, while fox hunting news should return news stories about fox hunting.

To facilitate these types of queries on full-text fields, Elasticsearch first analyzes the text, and then uses the results to build an inverted index. We will discuss the inverted index and the analysis process in the next two sections.

Inverted Index

Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.

For example, let’s say we have two documents, each with a content field containing the following:

1. The quick brown fox jumped over the lazy dog

2. Quick brown foxes leap over lazy dogs in summer

To create an inverted index, we first split the content field of each document into separate words (which we call terms, or tokens), create a sorted list of all the unique terms, and then list in which document each term appears. The result looks something like this:

Term Doc_1 Doc_2

-------------------------

Quick | | X

The | X |

brown | X | X

dog | X |

dogs | | X

fox | X |

foxes | | X

in | | X

jumped | X |

lazy | X | X

leap | | X

over | X | X

quick | X |

summer | | X

the | X |

------------------------

Now, if we want to search for quick brown, we just need to find the documents in which each term appears:

Term Doc_1 Doc_2

-------------------------

brown | X | X

quick | X |

------------------------

Total | 2 | 1

Both documents match, but the first document has more matches than the second. If we apply a naive similarity algorithm that just counts the number of matching terms, then we can say that the first document is a better match—is more relevant to our query—than the second document.

But there are a few problems with our current inverted index:

§ Quick and quick appear as separate terms, while the user probably thinks of them as the same word.

§ fox and foxes are pretty similar, as are dog and dogs; They share the same root word.

§ jumped and leap, while not from the same root word, are similar in meaning. They are synonyms.

With the preceding index, a search for +Quick +fox wouldn’t match any documents. (Remember, a preceding + means that the word must be present.) Both the term Quick and the term fox have to be in the same document in order to satisfy the query, but the first doc contains quick fox and the second doc contains Quick foxes.

Our user could reasonably expect both documents to match the query. We can do better.

If we normalize the terms into a standard format, then we can find documents that contain terms that are not exactly the same as the user requested, but are similar enough to still be relevant. For instance:

§ Quick can be lowercased to become quick.

§ foxes can be stemmed--reduced to its root form—to become fox. Similarly, dogs could be stemmed to dog.

§ jumped and leap are synonyms and can be indexed as just the single term jump.

Now the index looks like this:

Term Doc_1 Doc_2

-------------------------

brown | X | X

dog | X | X

fox | X | X

in | | X

jump | X | X

lazy | X | X

over | X | X

quick | X | X

summer | | X

the | X | X

------------------------

But we’re not there yet. Our search for +Quick +fox would still fail, because we no longer have the exact term Quick in our index. However, if we apply the same normalization rules that we used on the content field to our query string, it would become a query for +quick +fox, which would match both documents!

NOTE

This is very important. You can find only terms that exist in your index, so both the indexed text and the query string must be normalized into the same form.

This process of tokenization and normalization is called analysis, which we discuss in the next section.

Analysis and Analyzers

Analysis is a process that consists of the following:

§ First, tokenizing a block of text into individual terms suitable for use in an inverted index,

§ Then normalizing these terms into a standard form to improve their “searchability,” or recall

This job is performed by analyzers. An analyzer is really just a wrapper that combines three functions into a single package:

Character filters

First, the string is passed through any character filters in turn. Their job is to tidy up the string before tokenization. A character filter could be used to strip out HTML, or to convert & characters to the word and.

Tokenizer

Next, the string is tokenized into individual terms by a tokenizer. A simple tokenizer might split the text into terms whenever it encounters whitespace or punctuation.

Token filters

Last, each term is passed through any token filters in turn, which can change terms (for example, lowercasing Quick), remove terms (for example, stopwords such as a, and, the) or add terms (for example, synonyms like jump and leap).

Elasticsearch provides many character filters, tokenizers, and token filters out of the box. These can be combined to create custom analyzers suitable for different purposes. We discuss these in detail in “Custom Analyzers”.

Built-in Analyzers

However, Elasticsearch also ships with prepackaged analyzers that you can use directly. We list the most important ones next and, to demonstrate the difference in behavior, we show what terms each would produce from this string:

"Set the shape to semi-transparent by calling set_trans(5)"

Standard analyzer

The standard analyzer is the default analyzer that Elasticsearch uses. It is the best general choice for analyzing text that may be in any language. It splits the text on word boundaries, as defined by the Unicode Consortium, and removes most punctuation. Finally, it lowercases all terms. It would produce

set, the, shape, to, semi, transparent, by, calling, set_trans, 5

Simple analyzer

The simple analyzer splits the text on anything that isn’t a letter, and lowercases the terms. It would produce

set, the, shape, to, semi, transparent, by, calling, set, trans

Whitespace analyzer

The whitespace analyzer splits the text on whitespace. It doesn’t lowercase. It would produce

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

Language analyzers

Language-specific analyzers are available for many languages. They are able to take the peculiarities of the specified language into account. For instance, the english analyzer comes with a set of English stopwords (common words like and or the that don’t have much impact on relevance), which it removes. This analyzer also is able to stem English words because it understands the rules of English grammar.

The english analyzer would produce the following:

set, shape, semi, transpar, call, set_tran, 5

Note how transparent, calling, and set_trans have been stemmed to their root form.

When Analyzers Are Used

When we index a document, its full-text fields are analyzed into terms that are used to create the inverted index. However, when we search on a full-text field, we need to pass the query string through the same analysis process, to ensure that we are searching for terms in the same form as those that exist in the index.

Full-text queries, which we discuss later, understand how each field is defined, and so they can do the right thing:

§ When you query a full-text field, the query will apply the same analyzer to the query string to produce the correct list of terms to search for.

§ When you query an exact-value field, the query will not analyze the query string, but instead search for the exact value that you have specified.

Now you can understand why the queries that we demonstrated at the start of this chapter return what they do:

§ The date field contains an exact value: the single term 2014-09-15.

§ The _all field is a full-text field, so the analysis process has converted the date into the three terms: 2014, 09, and 15.

When we query the _all field for 2014, it matches all 12 tweets, because all of them contain the term 2014:

GET /_search?q=2014 # 12 results

When we query the _all field for 2014-09-15, it first analyzes the query string to produce a query that matches any of the terms 2014, 09, or 15. This also matches all 12 tweets, because all of them contain the term 2014:

GET /_search?q=2014-09-15 # 12 results !

When we query the date field for 2014-09-15, it looks for that exact date, and finds one tweet only:

GET /_search?q=date:2014-09-15 # 1 result

When we query the date field for 2014, it finds no documents because none contain that exact date:

GET /_search?q=date:2014 # 0 results !

Testing Analyzers

Especially when you are new to Elasticsearch, it is sometimes difficult to understand what is actually being tokenized and stored into your index. To better understand what is going on, you can use the analyze API to see how text is analyzed. Specify which analyzer to use in the query-string parameters, and the text to analyze in the body:

GET /_analyze?analyzer=standard

Text to analyze

Each element in the result represents a single term:

{

"tokens": [

{

"token": "text",

"start_offset": 0,

"end_offset": 4,

"type": "<ALPHANUM>",

"position": 1

},

{

"token": "to",

"start_offset": 5,

"end_offset": 7,

"type": "<ALPHANUM>",

"position": 2

},

{

"token": "analyze",

"start_offset": 8,

"end_offset": 15,

"type": "<ALPHANUM>",

"position": 3

}

]

}

The token is the actual term that will be stored in the index. The position indicates the order in which the terms appeared in the original text. The start_offset and end_offset indicate the character positions that the original word occupied in the original string.

TIP

The type values like <ALPHANUM> vary per analyzer and can be ignored. The only place that they are used in Elasticsearch is in the keep_types token filter.

The analyze API is a useful tool for understanding what is happening inside Elasticsearch indices, and we will talk more about it as we progress.

Specifying Analyzers

When Elasticsearch detects a new string field in your documents, it automatically configures it as a full-text string field and analyzes it with the standard analyzer.

You don’t always want this. Perhaps you want to apply a different analyzer that suits the language your data is in. And sometimes you want a string field to be just a string field—to index the exact value that you pass in, without any analysis, such as a string user ID or an internal status field or tag.

To achieve this, we have to configure these fields manually by specifying the mapping.

Mapping

In order to be able to treat date fields as dates, numeric fields as numbers, and string fields as full-text or exact-value strings, Elasticsearch needs to know what type of data each field contains. This information is contained in the mapping.

As explained in Chapter 3, each document in an index has a type. Every type has its own mapping, or schema definition. A mapping defines the fields within a type, the datatype for each field, and how the field should be handled by Elasticsearch. A mapping is also used to configure metadata associated with the type.

We discuss mappings in detail in “Types and Mappings”. In this section, we’re going to look at just enough to get you started.

Core Simple Field Types

Elasticsearch supports the following simple field types:

§ String: string

§ Whole number: byte, short, integer, long

§ Floating-point: float, double

§ Boolean: boolean

§ Date: date

When you index a document that contains a new field—one previously not seen—Elasticsearch will use dynamic mapping to try to guess the field type from the basic datatypes available in JSON, using the following rules:

JSON type

Field type

Boolean: true or false

boolean

Whole number: 123

long

Floating point: 123.45

double

String, valid date: 2014-09-15

date

String: foo bar

string

NOTE

This means that if you index a number in quotes ("123"), it will be mapped as type string, not type long. However, if the field is already mapped as type long, then Elasticsearch will try to convert the string into a long, and throw an exception if it can’t.

Viewing the Mapping

We can view the mapping that Elasticsearch has for one or more types in one or more indices by using the /_mapping endpoint. At the start of this chapter, we already retrieved the mapping for type tweet in index gb:

GET /gb/_mapping/tweet

This shows us the mapping for the fields (called properties) that Elasticsearch generated dynamically from the documents that we indexed:

{

"gb": {

"mappings": {

"tweet": {

"properties": {

"date": {

"type": "date",

"format": "dateOptionalTime"

},

"name": {

"type": "string"

},

"tweet": {

"type": "string"

},

"user_id": {

"type": "long"

}

}

}

}

}

}

TIP

Incorrect mappings, such as having an age field mapped as type string instead of integer, can produce confusing results to your queries.

Instead of assuming that your mapping is correct, check it!

Customizing Field Mappings

While the basic field datatypes are sufficient for many cases, you will often need to customize the mapping for individual fields, especially string fields. Custom mappings allow you to do the following:

§ Distinguish between full-text string fields and exact value string fields

§ Use language-specific analyzers

§ Optimize a field for partial matching

§ Specify custom date formats

§ And much more

The most important attribute of a field is the type. For fields other than string fields, you will seldom need to map anything other than type:

{

"number_of_clicks": {

"type": "integer"

}

}

Fields of type string are, by default, considered to contain full text. That is, their value will be passed through an analyzer before being indexed, and a full-text query on the field will pass the query string through an analyzer before searching.

The two most important mapping attributes for string fields are index and analyzer.

index

The index attribute controls how the string will be indexed. It can contain one of three values:

analyzed

First analyze the string and then index it. In other words, index this field as full text.

not_analyzed

Index this field, so it is searchable, but index the value exactly as specified. Do not analyze it.

no

Don’t index this field at all. This field will not be searchable.

The default value of index for a string field is analyzed. If we want to map the field as an exact value, we need to set it to not_analyzed:

{

"tag": {

"type": "string",

"index": "not_analyzed"

}

}

NOTE

The other simple types (such as long, double, date etc) also accept the index parameter, but the only relevant values are no and not_analyzed, as their values are never analyzed.

analyzer

For analyzed string fields, use the analyzer attribute to specify which analyzer to apply both at search time and at index time. By default, Elasticsearch uses the standard analyzer, but you can change this by specifying one of the built-in analyzers, such as whitespace, simple, orenglish:

{

"tweet": {

"type": "string",

"analyzer": "english"

}

}

In “Custom Analyzers”, we show you how to define and use custom analyzers as well.

Updating a Mapping

You can specify the mapping for a type when you first create an index. Alternatively, you can add the mapping for a new type (or update the mapping for an existing type) later, using the /_mapping endpoint.

NOTE

Although you can add to an existing mapping, you can’t change it. If a field already exists in the mapping, the data from that field probably has already been indexed. If you were to change the field mapping, the already indexed data would be wrong and would not be properly searchable.

We can update a mapping to add a new field, but we can’t change an existing field from analyzed to not_analyzed.

To demonstrate both ways of specifying mappings, let’s first delete the gb index:

DELETE /gb

Then create a new index, specifying that the tweet field should use the english analyzer:

PUT /gb 1

{

"mappings": {

"tweet" : {

"properties" : {

"tweet" : {

"type" : "string",

"analyzer": "english"

},

"date" : {

"type" : "date"

},

"name" : {

"type" : "string"

},

"user_id" : {

"type" : "long"

}

}

}

}

}

1

This creates the index with the mappings specified in the body.

Later on, we decide to add a new not_analyzed text field called tag to the tweet mapping, using the _mapping endpoint:

PUT /gb/_mapping/tweet

{

"properties" : {

"tag" : {

"type" : "string",

"index": "not_analyzed"

}

}

}

Note that we didn’t need to list all of the existing fields again, as we can’t change them anyway. Our new field has been merged into the existing mapping.

Testing the Mapping

You can use the analyze API to test the mapping for string fields by name. Compare the output of these two requests:

GET /gb/_analyze?field=tweet

Black-cats 1

GET /gb/_analyze?field=tag

Black-cats 1

1

The text we want to analyze is passed in the body.

The tweet field produces the two terms black and cat, while the tag field produces the single term Black-cats. In other words, our mapping is working correctly.

Complex Core Field Types

Besides the simple scalar datatypes that we have mentioned, JSON also has null values, arrays, and objects, all of which are supported by Elasticsearch.

Multivalue Fields

It is quite possible that we want our tag field to contain more than one tag. Instead of a single string, we could index an array of tags:

{ "tag": [ "search", "nosql" ]}

There is no special mapping required for arrays. Any field can contain zero, one, or more values, in the same way as a full-text field is analyzed to produce multiple terms.

By implication, this means that all the values of an array must be of the same datatype. You can’t mix dates with strings. If you create a new field by indexing an array, Elasticsearch will use the datatype of the first value in the array to determine the type of the new field.

NOTE

When you get a document back from Elasticsearch, any arrays will be in the same order as when you indexed the document. The _source field that you get back contains exactly the same JSON document that you indexed.

However, arrays are indexed—made searchable—as multivalue fields, which are unordered. At search time, you can’t refer to “the first element” or “the last element.” Rather, think of an array as a bag of values.

Empty Fields

Arrays can, of course, be empty. This is the equivalent of having zero values. In fact, there is no way of storing a null value in Lucene, so a field with a null value is also considered to be an empty field.

These four fields would all be considered to be empty, and would not be indexed:

"null_value": null,

"empty_array": [],

"array_with_null_value": [ null ]

Multilevel Objects

The last native JSON datatype that we need to discuss is the object — known in other languages as a hash, hashmap, dictionary or associative array.

Inner objects are often used to embed one entity or object inside another. For instance, instead of having fields called user_name and user_id inside our tweet document, we could write it as follows:

{

"tweet": "Elasticsearch is very flexible",

"user": {

"id": "@johnsmith",

"gender": "male",

"age": 26,

"name": {

"full": "John Smith",

"first": "John",

"last": "Smith"

}

}

}

Mapping for Inner Objects

Elasticsearch will detect new object fields dynamically and map them as type object, with each inner field listed under properties:

{

"gb": {

"tweet": { 1

"properties": {

"tweet": { "type": "string" },

"user": { 2

"type": "object",

"properties": {

"id": { "type": "string" },

"gender": { "type": "string" },

"age": { "type": "long" },

"name": { 2

"type": "object",

"properties": {

"full": { "type": "string" },

"first": { "type": "string" },

"last": { "type": "string" }

}

}

}

}

}

}

}

}

1

Root object

2

Inner objects

The mapping for the user and name fields has a similar structure to the mapping for the tweet type itself. In fact, the type mapping is just a special type of object mapping, which we refer to as the root object. It is just the same as any other object, except that it has some special top-level fields for document metadata, such as _source, and the _all field.

How Inner Objects are Indexed

Lucene doesn’t understand inner objects. A Lucene document consists of a flat list of key-value pairs. In order for Elasticsearch to index inner objects usefully, it converts our document into something like this:

{

"tweet": [elasticsearch, flexible, very],

"user.id": [@johnsmith],

"user.gender": [male],

"user.age": [26],

"user.name.full": [john, smith],

"user.name.first": [john],

"user.name.last": [smith]

}

Inner fields can be referred to by name (for example, first). To distinguish between two fields that have the same name, we can use the full path (for example, user.name.first) or even the type name plus the path (tweet.user.name.first).

NOTE

In the preceding simple flattened document, there is no field called user and no field called user.name. Lucene indexes only scalar or simple values, not complex data structures.

Arrays of Inner Objects

Finally, consider how an array containing inner objects would be indexed. Let’s say we have a followers array that looks like this:

{

"followers": [

{ "age": 35, "name": "Mary White"},

{ "age": 26, "name": "Alex Jones"},

{ "age": 19, "name": "Lisa Smith"}

]

}

This document will be flattened as we described previously, but the result will look like this:

{

"followers.age": [19, 26, 35],

"followers.name": [alex, jones, lisa, smith, mary, white]

}

The correlation between {age: 35} and {name: Mary White} has been lost as each multivalue field is just a bag of values, not an ordered array. This is sufficient for us to ask, “Is there a follower who is 26 years old?”

But we can’t get an accurate answer to this: “Is there a follower who is 26 years old and who is called Alex Jones?”

Correlated inner objects, which are able to answer queries like these, are called nested objects, and we cover them later, in Chapter 41.