Elasticsearch: The Definitive Guide (2015)

Part III. Dealing with Human Language

Chapter 19. Identifying Words

A word in English is relatively simple to spot: words are separated by whitespace or (some) punctuation. Even in English, though, there can be controversy: is you’re one word or two? What about o’clock, cooperate, half-baked, or eyewitness?

Languages like German or Dutch combine individual words to create longer compound words like Weißkopfseeadler (white-headed sea eagle), but in order to be able to return Weißkopfseeadler as a result for the query Adler (eagle), we need to understand how to break up compound words into their constituent parts.

Asian languages are even more complex: some have no whitespace between words, sentences, or even paragraphs. Some words can be represented by a single character, but the same single character, when placed next to other characters, can form just one part of a longer word with a quite different meaning.

It should be obvious that there is no silver-bullet analyzer that will miraculously deal with all human languages. Elasticsearch ships with dedicated analyzers for many languages, and more language-specific analyzers are available as plug-ins.

However, not all languages have dedicated analyzers, and sometimes you won’t even be sure which language(s) you are dealing with. For these situations, we need good standard tools that do a reasonable job regardless of language.

standard Analyzer

The standard analyzer is used by default for any full-text analyzed string field. If we were to reimplement the standard analyzer as a custom analyzer, it would be defined as follows:

{

"type": "custom",

"tokenizer": "standard",

"filter": [ "lowercase", "stop" ]

}

In Chapter 20 and Chapter 22, we talk about the lowercase, and stop token filters, but for the moment, let’s focus on the standard tokenizer.

standard Tokenizer

A tokenizer accepts a string as input, processes the string to break it into individual words, or tokens (perhaps discarding some characters like punctuation), and emits a token stream as output.

What is interesting is the algorithm that is used to identify words. The whitespace tokenizer simply breaks on whitespace—spaces, tabs, line feeds, and so forth—and assumes that contiguous nonwhitespace characters form a single token. For instance:

GET /_analyze?tokenizer=whitespace

You're the 1st runner home!

This request would return the following terms: You're, the, 1st, runner, home!

The letter tokenizer, on the other hand, breaks on any character that is not a letter, and so would return the following terms: You, re, the, st, runner, home.

The standard tokenizer uses the Unicode Text Segmentation algorithm (as defined in Unicode Standard Annex #29) to find the boundaries between words, and emits everything in-between. Its knowledge of Unicode allows it to successfully tokenize text containing a mixture of languages.

Punctuation may or may not be considered part of a word, depending on where it appears:

GET /_analyze?tokenizer=standard

You're my 'favorite'.

In this example, the apostrophe in You're is treated as part of the word, while the single quotes in 'favorite' are not, resulting in the following terms: You're, my, favorite.

TIP

The uax_url_email tokenizer works in exactly the same way as the standard tokenizer, except that it recognizes email addresses and URLs and emits them as single tokens. The standard tokenizer, on the other hand, would try to break them into individual words. For instance, the email address joe-bloggs@foo-bar.com would result in the tokens joe, bloggs, foo,bar.com.

The standard tokenizer is a reasonable starting point for tokenizing most languages, especially Western languages. In fact, it forms the basis of most of the language-specific analyzers like the english, french, and spanish analyzers. Its support for Asian languages, however, is limited, and you should consider using the icu_tokenizer instead, which is available in the ICU plug-in.

Installing the ICU Plug-in

The ICU analysis plug-in for Elasticsearch uses the International Components for Unicode (ICU) libraries (see site.project.org) to provide a rich set of tools for dealing with Unicode. These include the icu_tokenizer, which is particularly useful for Asian languages, and a number of token filters that are essential for correct matching and sorting in all languages other than English.

NOTE

The ICU plug-in is an essential tool for dealing with languages other than English, and it is highly recommended that you install and use it. Unfortunately, because it is based on the external ICU libraries, different versions of the ICU plug-in may not be compatible with previous versions. When upgrading, you may need to reindex your data.

To install the plug-in, first shut down your Elasticsearch node and then run the following command from the Elasticsearch home directory:

./bin/plugin -install elasticsearch/elasticsearch-analysis-icu/$VERSION

The current $VERSION can be found at https://github.com/elasticsearch/elasticsearch-analysis-icu.

Once installed, restart Elasticsearch, and you should see a line similar to the following in the startup logs:

[INFO][plugins] [Mysterio] loaded [marvel, analysis-icu], sites [marvel]

If you are running a cluster with multiple nodes, you will need to install the plug-in on every node in the cluster.

icu_tokenizer

The icu_tokenizer uses the same Unicode Text Segmentation algorithm as the standard tokenizer, but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

For instance, compare the tokens produced by the standard and icu_tokenizers, respectively, when tokenizing “Hello. I am from Bangkok.” in Thai:

GET /_analyze?tokenizer=standard

สวัสดี ผมมาจากกรุงเทพฯ

The standard tokenizer produces two tokens, one for each sentence: สวัสดี, ผมมาจากกรุงเทพฯ. That is useful only if you want to search for the whole sentence “I am from Bangkok.”, but not if you want to search for just “Bangkok.”

GET /_analyze?tokenizer=icu_tokenizer

สวัสดี ผมมาจากกรุงเทพฯ

The icu_tokenizer, on the other hand, is able to break up the text into the individual words (สวัสดี, ผม, มา, จาก, กรุงเทพฯ), making them easier to search.

In contrast, the standard tokenizer “over-tokenizes” Chinese and Japanese text, often breaking up whole words into single characters. Because there are no spaces between words, it can be difficult to tell whether consecutive characters are separate words or form a single word. For instance:

§ 向 means facing, 日 means sun, and 葵 means hollyhock. When written together, 向日葵 means sunflower.

§ 五 means five or fifth, 月 means month, and 雨 means rain. The first two characters written together as 五月 mean the month of May, and adding the third character, 五月雨 means continuous rain. When combined with a fourth character, 式, meaning style, the word 五月雨式 becomes an adjective for anything consecutive or unrelenting.

Although each character may be a word in its own right, tokens are more meaningful when they retain the bigger original concept instead of just the component parts:

GET /_analyze?tokenizer=standard

向日葵

GET /_analyze?tokenizer=icu_tokenizer

向日葵

The standard tokenizer in the preceding example would emit each character as a separate token: 向, 日, 葵. The icu_tokenizer would emit the single token 向日葵 (sunflower).

Another difference between the standard tokenizer and the icu_tokenizer is that the latter will break a word containing characters written in different scripts (for example, βeta) into separate tokens—β, eta—while the former will emit the word as a single token: βeta.

Tidying Up Input Text

Tokenizers produce the best results when the input text is clean, valid text, where valid means that it follows the punctuation rules that the Unicode algorithm expects. Quite often, though, the text we need to process is anything but clean. Cleaning it up before tokenization improves the quality of the output.

Tokenizing HTML

Passing HTML through the standard tokenizer or the icu_tokenizer produces poor results. These tokenizers just don’t know what to do with the HTML tags. For example:

GET /_analyzer?tokenizer=standard

<p>Some déjà vu <a href="http://somedomain.com>">website</a>

The standard tokenizer confuses HTML tags and entities, and emits the following tokens: p, Some, d, eacute, j, agrave, vu, a, href, http, somedomain.com, website, a. Clearly not what was intended!

Character filters can be added to an analyzer to preprocess the text before it is passed to the tokenizer. In this case, we can use the html_strip character filter to remove HTML tags and to decode HTML entities such as é into the corresponding Unicode characters.

Character filters can be tested out via the analyze API by specifying them in the query string:

GET /_analyzer?tokenizer=standard&char_filters=html_strip

<p>Some déjà vu <a href="http://somedomain.com>">website</a>

To use them as part of the analyzer, they should be added to a custom analyzer definition:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"my_html_analyzer": {

"tokenizer": "standard",

"char_filter": [ "html_strip" ]

}

Once created, our new my_html_analyzer can be tested with the analyze API:

GET /my_index/_analyzer?analyzer=my_html_analyzer

<p>Some déjà vu <a href="http://somedomain.com>">website</a>

This emits the tokens that we expect: Some, déjà, vu, website.

Tidying Up Punctuation

The standard tokenizer and icu_tokenizer both understand that an apostrophe within a word should be treated as part of the word, while single quotes that surround a word should not. Tokenizing the text You're my 'favorite'. would correctly emit the tokens You're, my, favorite.

Unfortunately, Unicode lists a few characters that are sometimes used as apostrophes:

U+0027

Apostrophe (')—the original ASCII character

U+2018

Left single-quotation mark (‘)—opening quote when single-quoting

U+2019

Right single-quotation mark (’)—closing quote when single-quoting, but also the preferred character to use as an apostrophe

Both tokenizers treat these three characters as an apostrophe (and thus as part of the word) when they appear within a word. Then there are another three apostrophe-like characters:

U+201B

Single high-reversed-9 quotation mark (‛)—same as U+2018 but differs in appearance

U+0091

Left single-quotation mark in ISO-8859-1—should not be used in Unicode

U+0092

Right single-quotation mark in ISO-8859-1—should not be used in Unicode

Both tokenizers treat these three characters as word boundaries—a place to break text into tokens. Unfortunately, some publishers use U+201B as a stylized way to write names like M‛coy, and the second two characters may well be produced by your word processor, depending on its age.

Even when using the “acceptable” quotation marks, a word written with a single right quotation mark—You’re—is not the same as the word written with an apostrophe—You're—which means that a query for one variant will not find the other.

Fortunately, it is possible to sort out this mess with the mapping character filter, which allows us to replace all instances of one character with another. In this case, we will replace all apostrophe variants with the simple U+0027 apostrophe:

PUT /my_index

{

"settings": {

"analysis": {

"char_filter": {

"quotes": {

"type": "mapping",

"mappings": [

"\\u0091=>\\u0027",

"\\u0092=>\\u0027",

"\\u2018=>\\u0027",

"\\u2019=>\\u0027",

"\\u201B=>\\u0027"

]

}

"analyzer": {

"quotes_analyzer": {

"tokenizer": "standard",

"char_filter": [ "quotes" ]

}

We define a custom char_filter called quotes that maps all apostrophe variants to a simple apostrophe.

For clarity, we have used the JSON Unicode escape syntax for each character, but we could just have used the characters themselves: "‘=>'".

We use our custom quotes character filter to create a new analyzer called quotes_analyzer.

As always, we test the analyzer after creating it:

GET /my_index/_analyze?analyzer=quotes_analyzer

You’re my ‘favorite’ M‛Coy

This example returns the following tokens, with all of the in-word quotation marks replaced by apostrophes: You're, my, favorite, M'Coy.

The more effort that you put into ensuring that the tokenizer receives good-quality input, the better your search results will be.