Normalizing Tokens - Dealing with Human Language - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part III. Dealing with Human Language

Chapter 20. Normalizing Tokens

Breaking text into tokens is only half the job. To make those tokens more easily searchable, they need to go through a normalization process to remove insignificant differences between otherwise identical words, such as uppercase versus lowercase. Perhaps we also need to remove significant differences, to make esta, ésta, and está all searchable as the same word. Would you search for déjà vu, or just for deja vu?

This is the job of the token filters, which receive a stream of tokens from the tokenizer. You can have multiple token filters, each doing its particular job. Each receives the new token stream as output by the token filter before it.

In That Case

The most frequently used token filter is the lowercase filter, which does exactly what you would expect; it transforms each token into its lowercase form:

GET /_analyze?tokenizer=standard&filters=lowercase

The QUICK Brown FOX! 1

1

Emits tokens the, quick, brown, fox

It doesn’t matter whether users search for fox or FOX, as long as the same analysis process is applied at query time and at search time. The lowercase filter will transform a query for FOX into a query for fox, which is the same token that we have stored in our inverted index.

To use token filters as part of the analysis process, we can create a custom analyzer:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"my_lowercaser": {

"tokenizer": "standard",

"filter": [ "lowercase" ]

}

}

}

}

}

And we can test it out with the analyze API:

GET /my_index/_analyze?analyzer=my_lowercaser

The QUICK Brown FOX! 1

1

Emits tokens the, quick, brown, fox

You Have an Accent

English uses diacritics (like ´, ^, and ¨) only for imported words—like rôle, déjà, and däis—but usually they are optional. Other languages require diacritics in order to be correct. Of course, just because words are spelled correctly in your index doesn’t mean that the user will search for the correct spelling.

It is often useful to strip diacritics from words, allowing rôle to match role, and vice versa. With Western languages, this can be done with the asciifolding character filter. Actually, it does more than just strip diacritics. It tries to convert many Unicode characters into a simpler ASCII representation:

§ ß ⇒ ss

§ æ ⇒ ae

§ ł ⇒ l

§ ɰ ⇒ m

§ ⁇ ⇒ ??

§ ❷ ⇒ 2

§ ⁶ ⇒ 6

Like the lowercase filter, the asciifolding filter doesn’t require any configuration but can be included directly in a custom analyzer:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"folding": {

"tokenizer": "standard",

"filter": [ "lowercase", "asciifolding" ]

}

}

}

}

}

GET /my_index?analyzer=folding

My œsophagus caused a débâcle 1

1

Emits my, oesophagus, caused, a, debacle

Retaining Meaning

Of course, when you strip diacritical marks from a word, you lose meaning. For instance, consider these three Spanish words:

esta

Feminine form of the adjective this, as in esta silla (this chair) or esta (this one).

ésta

An archaic form of esta.

está

The third-person form of the verb estar (to be), as in está feliz (he is happy).

While we would like to conflate the first two forms, they differ in meaning from the third form, which we would like to keep separate. Similarly:

The first person form of the verb saber (to know) as in Yo sé (I know).

se

The third-person reflexive pronoun used with many verbs, such as se sabe (it is known).

Unfortunately, there is no easy way to separate words that should have their diacritics removed from words that shouldn’t. And it is quite likely that your users won’t know either.

Instead, we index the text twice: once in the original form and once with diacritics removed:

PUT /my_index/_mapping/my_type

{

"properties": {

"title": { 1

"type": "string",

"analyzer": "standard",

"fields": {

"folded": { 2

"type": "string",

"analyzer": "folding"

}

}

}

}

}

1

The title field uses the standard analyzer and will contain the original word with diacritics in place.

2

The title.folded field uses the folding analyzer, which strips the diacritical marks.

You can test the field mappings by using the analyze API on the sentence Esta está loca (This woman is crazy):

GET /my_index/_analyze?field=title 1

Esta está loca

GET /my_index/_analyze?field=title.folded 2

Esta está loca

1

Emits esta, está, loca

2

Emits esta, esta, loca

Let’s index some documents to test it out:

PUT /my_index/my_type/1

{ "title": "Esta loca!" }

PUT /my_index/my_type/2

{ "title": "Está loca!" }

Now we can search across both fields, using the multi_match query in most_fields mode to combine the scores from each field:

GET /my_index/_search

{

"query": {

"multi_match": {

"type": "most_fields",

"query": "esta loca",

"fields": [ "title", "title.folded" ]

}

}

}

Running this query through the validate-query API helps to explain how the query is executed:

GET /my_index/_validate/query?explain

{

"query": {

"multi_match": {

"type": "most_fields",

"query": "está loca",

"fields": [ "title", "title.folded" ]

}

}

}

The multi-match query searches for the original form of the word (está) in the title field, and the form without diacritics esta in the title.folded field:

(title:está title:loca )

(title.folded:esta title.folded:loca)

It doesn’t matter whether the user searches for esta or está; both documents will match because the form without diacritics exists in the the title.folded field. However, only the original form exists in the title field. This extra match will push the document containing the original form of the word to the top of the results list.

We use the title.folded field to widen the net in order to match more documents, and use the original title field to push the most relevant document to the top. This same technique can be used wherever an analyzer is used, to increase matches at the expense of meaning.

TIP

The asciifolding filter does have an option called preserve_original that allows you to index the original token and the folded token in the same position in the same field. With this option enabled, you would end up with something like this:

Position 1 Position 2

--------------------------

(ésta,esta) loca

--------------------------

While this appears to be a nice way to save space, it does mean that you have no way of saying, “Give me an exact match on the original word.” Mixing tokens with and without diacritics can also end up interfering with term-frequency counts, resulting in less-reliable relevance calcuations.

As a rule, it is cleaner to index each field variant into a separate field, as we have done in this section.

Living in a Unicode World

When Elasticsearch compares one token with another, it does so at the byte level. In other words, for two tokens to be considered the same, they need to consist of exactly the same bytes. Unicode, however, allows you to write the same letter in different ways.

For instance, what’s the difference between é and é? It depends on who you ask. According to Elasticsearch, the first one consists of the two bytes 0xC3 0xA9, and the second one consists of three bytes, 0x65 0xCC 0x81.

According to Unicode, the differences in how they are represented as bytes is irrelevant, and they are the same letter. The first one is the single letter é, while the second is a plain e combined with an acute accent ´.

If you get your data from more than one source, it may happen that you have the same letters encoded in different ways, which may result in one form of déjà not matching another!

Fortunately, a solution is at hand. There are four Unicode normalization forms, all of which convert Unicode characters into a standard format, making all characters comparable at a byte level: nfc, nfd, nfkc, nfkd.

UNICODE NORMALIZATION FORMS

The composed forms—nfc and nfkc—represent characters in the fewest bytes possible. So é is represented as the single letter é. The decomposed forms—nfd and nfkd—represent characters by their constituent parts, that is e + ´.

The canonical forms—nfc and nfd—represent ligatures like ffi or œ as a single character, while the compatibility forms—nfkc and nfkd—break down these composed characters into a simpler multiletter equivalent: f + f + i or o + e.

It doesn’t really matter which normalization form you choose, as long as all your text is in the same form. That way, the same tokens consist of the same bytes. That said, the compatibility forms allow you to compare ligatures like ffi with their simpler representation, ffi.

You can use the icu_normalizer token filter to ensure that all of your tokens are in the same form:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"nfkc_normalizer": { 1

"type": "icu_normalizer",

"name": "nfkc"

}

},

"analyzer": {

"my_normalizer": {

"tokenizer": "icu_tokenizer",

"filter": [ "nfkc_normalizer" ]

}

}

}

}

}

1

Normalize all tokens into the nfkc normalization form.

TIP

Besides the icu_normalizer token filter mentioned previously, there is also an icu_normalizer character filter, which does the same job as the token filter, but does so before the text reaches the tokenizer. When using the standard tokenizer or icu_tokenizer, this doesn’t really matter. These tokenizers know how to deal with all forms of Unicode correctly.

However, if you plan on using a different tokenizer, such as the ngram, edge_ngram, or pattern tokenizers, it would make sense to use the icu_normalizer character filter in preference to the token filter.

Usually, though, you will want to not only normalize the byte order of tokens, but also lowercase them. This can be done with icu_normalizer, using the custom normalization form nfkc_cf, which we discuss in the next section.

Unicode Case Folding

Humans are nothing if not inventive, and human language reflects that. Changing the case of a word seems like such a simple task, until you have to deal with multiple languages.

Take, for example, the lowercase German letter ß. Converting that to upper case gives you SS, which converted back to lowercase gives you ss. Or consider the Greek letter ς (sigma, when used at the end of a word). Converting it to uppercase results in Σ, which converted back to lowercase, gives you σ.

The whole point of lowercasing terms is to make them more likely to match, not less! In Unicode, this job is done by case folding rather than by lowercasing. Case folding is the act of converting words into a (usually lowercase) form that does not necessarily result in the correct spelling, but does allow case-insensitive comparisons.

For instance, the letter ß, which is already lowercase, is folded to ss. Similarly, the lowercase ς is folded to σ, to make σ, ς, and Σ comparable, no matter where the letter appears in a word.

The default normalization form that the icu_normalizer token filter uses is nfkc_cf. Like the nfkc form, this does the following:

§ Composes characters into the shortest byte representation

§ Uses compatibility mode to convert characters like ffi into the simpler ffi

But it also does this:

§ Case-folds characters into a form suitable for case comparison

In other words, nfkc_cf is the equivalent of the lowercase token filter, but suitable for use with all languages. The on-steroids equivalent of the standard analyzer would be the following:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"my_lowercaser": {

"tokenizer": "icu_tokenizer",

"filter": [ "icu_normalizer" ] 1

}

}

}

}

}

1

The icu_normalizer defaults to the nfkc_cf form.

We can compare the results of running Weißkopfseeadler and WEISSKOPFSEEADLER (the uppercase equivalent) through the standard analyzer and through our Unicode-aware analyzer:

GET /_analyze?analyzer=standard 1

Weißkopfseeadler WEISSKOPFSEEADLER

GET /my_index/_analyze?analyzer=my_lowercaser 2

Weißkopfseeadler WEISSKOPFSEEADLER

1

Emits tokens weißkopfseeadler, weisskopfseeadler

2

Emits tokens weisskopfseeadler, weisskopfseeadler

The standard analyzer emits two different, incomparable tokens, while our custom analyzer produces tokens that are comparable, regardless of the original case.

Unicode Character Folding

In the same way as the lowercase token filter is a good starting point for many languages but falls short when exposed to the entire tower of Babel, so the asciifolding token filter requires a more effective Unicode character-folding counterpart for dealing with the many languages of the world.

The icu_folding token filter (provided by the icu plug-in) does the same job as the asciifolding filter, but extends the transformation to scripts that are not ASCII-based, such as Greek, Hebrew, Han, conversion of numbers in other scripts into their Latin equivalents, plus various other numeric, symbolic, and punctuation transformations.

The icu_folding token filter applies Unicode normalization and case folding from nfkc_cf automatically, so the icu_normalizer is not required:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"my_folder": {

"tokenizer": "icu_tokenizer",

"filter": [ "icu_folding" ]

}

}

}

}

}

GET /my_index/_analyze?analyzer=my_folder

١٢٣٤٥ 1

1

The Arabic numerals ١٢٣٤٥ are folded to their Latin equivalent: 12345.

If there are particular characters that you would like to protect from folding, you can use a UnicodeSet (much like a character class in regular expressions) to specify which Unicode characters may be folded. For instance, to exclude the Swedish letters å, ä, ö, Å, Ä, and Ö from folding, you would specify a character class representing all Unicode characters, except for those letters: [^åäöÅÄÖ] (^ means everything except).

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"swedish_folding": { 1

"type": "icu_folding",

"unicodeSetFilter": "[^åäöÅÄÖ]"

}

},

"analyzer": {

"swedish_analyzer": { 2

"tokenizer": "icu_tokenizer",

"filter": [ "swedish_folding", "lowercase" ]

}

}

}

}

}

1

The swedish_folding token filter customizes the icu_folding token filter to exclude Swedish letters, both uppercase and lowercase.

2

The swedish analyzer first tokenizes words, then folds each token by using the swedish_folding filter, and then lowercases each token in case it includes some of the uppercase excluded letters: Å, Ä, or Ö.

Sorting and Collations

So far in this chapter, we have looked at how to normalize tokens for the purposes of search. The final use case to consider in this chapter is that of string sorting.

In “String Sorting and Multifields”, we explained that Elasticsearch cannot sort on an analyzed string field, and demonstrated how to use multifields to index the same field once as an analyzed field for search, and once as a not_analyzed field for sorting.

The problem with sorting on an analyzed field is not that it uses an analyzer, but that the analyzer tokenizes the string value into multiple tokens, like a bag of words, and Elasticsearch doesn’t know which token to use for sorting.

Relying on a not_analyzed field for sorting is inflexible: it allows us to sort on only the exact value of the original string. However, we can use analyzers to achieve other sort orders, as long as our chosen analyzer always emits only a single token for each string value.

Case-Insensitive Sorting

Imagine that we have three user documents whose name fields contain Boffey, BROWN, and bailey, respectively. First we will apply the technique described in “String Sorting and Multifields” of using a not_analyzed field for sorting:

PUT /my_index

{

"mappings": {

"user": {

"properties": {

"name": { 1

"type": "string",

"fields": {

"raw": { 2

"type": "string",

"index": "not_analyzed"

}

}

}

}

}

}

}

1

The analyzed name field is used for search.

2

The not_analyzed name.raw field is used for sorting.

We can index some documents and try sorting:

PUT /my_index/user/1

{ "name": "Boffey" }

PUT /my_index/user/2

{ "name": "BROWN" }

PUT /my_index/user/3

{ "name": "bailey" }

GET /my_index/user/_search?sort=name.raw

The preceding search request would return the documents in this order: BROWN, Boffey, bailey. This is known as lexicographical order as opposed to alphabetical order. Essentially, the bytes used to represent capital letters have a lower value than the bytes used to represent lowercase letters, and so the names are sorted with the lowest bytes first.

That may make sense to a computer, but doesn’t make much sense to human beings who would reasonably expect these names to be sorted alphabetically, regardless of case. To achieve this, we need to index each name in a way that the byte ordering corresponds to the sort order that we want.

In other words, we need an analyzer that will emit a single lowercase token:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"case_insensitive_sort": {

"tokenizer": "keyword", 1

"filter": [ "lowercase" ] 2

}

}

}

}

}

1

The keyword tokenizer emits the original input string as a single unchanged token.

2

The lowercase token filter lowercases the token.

With the case_insentive_sort analyzer in place, we can now use it in our multifield:

PUT /my_index/_mapping/user

{

"properties": {

"name": {

"type": "string",

"fields": {

"lower_case_sort": { 1

"type": "string",

"analyzer": "case_insensitive_sort"

}

}

}

}

}

PUT /my_index/user/1

{ "name": "Boffey" }

PUT /my_index/user/2

{ "name": "BROWN" }

PUT /my_index/user/3

{ "name": "bailey" }

GET /my_index/user/_search?sort=name.lower_case_sort

1

The name.lower_case_sort field will provide us with case-insentive sorting.

The preceding search request returns our documents in the order that we expect: bailey, Boffey, BROWN.

But is this order correct? It appears to be correct as it matches our expectations, but our expectations have probably been influenced by the fact that this book is in English and all of the letters used in our example belong to the English alphabet.

What if we were to add the German name Böhm?

Now our names would be returned in this order: bailey, Boffey, BROWN, Böhm. The reason that böhm comes after BROWN is that these words are still being sorted by the values of the bytes used to represent them, and an r is stored as the byte 0x72, while ö is stored as 0xF6 and so is sorted last. The byte value of each character is an accident of history.

Clearly, the default sort order is meaningless for anything other than plain English. In fact, there is no “right” sort order. It all depends on the language you speak.

Differences Between Languages

Every language has its own sort order, and sometimes even multiple sort orders. Here are a few examples of how our four names from the previous section would be sorted in different contexts:

§ English: bailey, boffey, böhm, brown

§ German: bailey, boffey, böhm, brown

§ German phonebook: bailey, böhm, boffey, brown

§ Swedish: bailey, boffey, brown, böhm

NOTE

The reason that the German phonebook sort order places böhm before boffey is that ö and oe are considered synonyms when dealing with names and places, so böhm is sorted as if it had been written as boehm.

Unicode Collation Algorithm

Collation is the process of sorting text into a predefined order. The Unicode Collation Algorithm, or UCA (see www.unicode.org/reports/tr10) defines a method of sorting strings into the order defined in a Collation Element Table (usually referred to just as a collation).

The UCA also defines the Default Unicode Collation Element Table, or DUCET, which defines the default sort order for all Unicode characters, regardless of language. As you have already seen, there is no single correct sort order, so DUCET is designed to annoy as few people as possible as seldom as possible, but it is far from being a panacea for all sorting woes.

Instead, language-specific collations exist for pretty much every language under the sun. Most use DUCET as their starting point and add a few custom rules to deal with the peculiarities of each language.

The UCA takes a string and a collation as inputs and outputs a binary sort key. Sorting a collection of strings according to the specified collation then becomes a simple comparison of their binary sort keys.

Unicode Sorting

TIP

The approach described in this section will probably change in a future version of Elasticsearch. Check the icu plugin documentation for the latest information.

The icu_collation token filter defaults to using the DUCET collation for sorting. This is already an improvement over the default sort. To use it, all we need to do is to create an analyzer that uses the default icu_collation filter:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"ducet_sort": {

"tokenizer": "keyword",

"filter": [ "icu_collation" ] 1

}

}

}

}

}

1

Use the default DUCET collation.

Typically, the field that we want to sort on is also a field that we want to search on, so we use the same multifield approach as we used in “Case-Insensitive Sorting”:

PUT /my_index/_mapping/user

{

"properties": {

"name": {

"type": "string",

"fields": {

"sort": {

"type": "string",

"analyzer": "ducet_sort"

}

}

}

}

}

With this mapping, the name.sort field will contain a sort key that will be used only for sorting. We haven’t specified a language, so it defaults to using the DUCET collation.

Now, we can reindex our example docs and test the sorting:

PUT /my_index/user/_bulk

{ "index": { "_id": 1 }}

{ "name": "Boffey" }

{ "index": { "_id": 2 }}

{ "name": "BROWN" }

{ "index": { "_id": 3 }}

{ "name": "bailey" }

{ "index": { "_id": 4 }}

{ "name": "Böhm" }

GET /my_index/user/_search?sort=name.sort

NOTE

Note that the sort key returned with each document, which in earlier examples looked like brown and böhm, now looks like gobbledygook: ᖔ乏昫倈⠀\u0001. The reason is that the icu_collation filter emits keys intended only for efficient sorting, not for any other purposes.

The preceding search returns our docs in this order: bailey, Boffey, Böhm, BROWN. This is already an improvement, as the sort order is now correct for English and German, but it is still incorrect for German phonebooks and Swedish. The next step is to customize our mapping for different languages.

Specifying a Language

The icu_collation filter can be configured to use the collation table for a specific language, a country-specific version of a language, or some other subset such as German phonebooks. This can be done by creating a custom version of the token filter by using the language, country, and variant parameters as follows:

English

{ "language": "en" }

German

{ "language": "de" }

Austrian German

{ "language": "de", "country": "AT" }

German phonebooks

{ "language": "en", "variant": "@collation=phonebook" }

TIP

You can read more about the locales supported by ICU at: http://bit.ly/1u9LEdp.

This example shows how to set up the German phonebook sort order:

PUT /my_index

{

"settings": {

"number_of_shards": 1,

"analysis": {

"filter": {

"german_phonebook": { 1

"type": "icu_collation",

"language": "de",

"country": "DE",

"variant": "@collation=phonebook"

}

},

"analyzer": {

"german_phonebook": { 2

"tokenizer": "keyword",

"filter": [ "german_phonebook" ]

}

}

}

},

"mappings": {

"user": {

"properties": {

"name": {

"type": "string",

"fields": {

"sort": { 3

"type": "string",

"analyzer": "german_phonebook"

}

}

}

}

}

}

}

1

First we create a version of the icu_collation customized for the German phonebook collation.

2

Then we wrap that up in a custom analyzer.

3

And we apply it to our name.sort field.

Reindex the data and repeat the same search as we used previously:

PUT /my_index/user/_bulk

{ "index": { "_id": 1 }}

{ "name": "Boffey" }

{ "index": { "_id": 2 }}

{ "name": "BROWN" }

{ "index": { "_id": 3 }}

{ "name": "bailey" }

{ "index": { "_id": 4 }}

{ "name": "Böhm" }

GET /my_index/user/_search?sort=name.sort

This now returns our docs in this order: bailey, Böhm, Boffey, BROWN. In the German phonebook collation, Böhm is the equivalent of Boehm, which comes before Boffey.

Multiple sort orders

The same field can support multiple sort orders by using a multifield for each language:

PUT /my_index/_mapping/_user

{

"properties": {

"name": {

"type": "string",

"fields": {

"default": {

"type": "string",

"analyzer": "ducet" 1

},

"french": {

"type": "string",

"analyzer": "french" 1

},

"german": {

"type": "string",

"analyzer": "german_phonebook" 1

},

"swedish": {

"type": "string",

"analyzer": "swedish" 1

}

}

}

}

}

1

We would need to create the corresponding analyzers for each of these collations.

With this mapping in place, results can be ordered correctly for French, German, and Swedish users, just by sorting on the name.french, name.german, or name.swedish fields. Unsupported languages can fall back to using the name.default field, which uses the DUCET sort order.

Customizing Collations

The icu_collation token filter takes many more options than just language, country, and variant, which can be used to tailor the sorting algorithm. Options are available that will do the following:

§ Ignore diacritics

§ Order uppercase first or last, or ignore case

§ Take punctuation and whitespace into account or ignore it

§ Sort numbers as strings or by their numeric value

§ Customize existing collations or define your own custom collations

Details of these options are beyond the scope of this book, but more information can be found in the ICU plug-in documentation and in the ICU project collation documentation.