Getting Started with Languages - Dealing with Human Language - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part III. Dealing with Human Language

I know all those words, but that sentence makes no sense to me.

Matt Groening

Full-text search is a battle between precision—returning as few irrelevant documents as possible—and recall—returning as many relevant documents as possible. While matching only the exact words that the user has queried would be precise, it is not enough. We would miss out on many documents that the user would consider to be relevant. Instead, we need to spread the net wider, to also search for words that are not exactly the same as the original but are related.

Wouldn’t you expect a search for “quick brown fox” to match a document containing “fast brown foxes,” “Johnny Walker” to match “Johnnie Walker,” or “Arnolt Schwarzenneger” to match “Arnold Schwarzenegger”?

If documents exist that do contain exactly what the user has queried, those documents should appear at the top of the result set, but weaker matches can be included further down the list. If no documents match exactly, at least we can show the user potential matches; they may even be what the user originally intended!

There are several lines of attack:

§ Remove diacritics like ´, ^, and ¨ so that a search for rôle will also match role, and vice versa. See Chapter 20.

§ Remove the distinction between singular and plural—fox versus foxes—or between tenses—jumping versus jumped versus jumps—by stemming each word to its root form. See Chapter 21.

§ Remove commonly used words or stopwords like the, and, and or to improve search performance. See Chapter 22.

§ Including synonyms so that a query for quick could also match fast, or UK could match United Kingdom. See Chapter 23.

§ Check for misspellings or alternate spellings, or match on homophones—words that sound the same, like their versus there, meat versus meet versus mete. See Chapter 24.

Before we can manipulate individual words, we need to divide text into words, which means that we need to know what constitutes a word. We will tackle this in Chapter 19.

But first, let’s take a look at how to get started quickly and easily.

Chapter 18. Getting Started with Languages

Elasticsearch ships with a collection of language analyzers that provide good, basic, out-of-the-box support for many of the world’s most common languages:

Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kurdish, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.

These analyzers typically perform four roles:

§ Tokenize text into individual words:

The quick brown foxes → [The, quick, brown, foxes]

§ Lowercase tokens:

The → the

§ Remove common stopwords:

[The, quick, brown, foxes] → [quick, brown, foxes]

§ Stem tokens to their root form:

foxes → fox

Each analyzer may also apply other transformations specific to its language in order to make words from that language more searchable:

§ The english analyzer removes the possessive 's:

John's → john

§ The french analyzer removes elisions like l' and qu' and diacritics like ¨ or ^:

l'église → eglis

§ The german analyzer normalizes terms, replacing ä and ae with a, or ß with ss, among others:

äußerst → ausserst

Using Language Analyzers

The built-in language analyzers are available globally and don’t need to be configured before being used. They can be specified directly in the field mapping:

PUT /my_index

{

"mappings": {

"blog": {

"properties": {

"title": {

"type": "string",

"analyzer": "english" 1

}

}

}

}

}

1

The title field will use the english analyzer instead of the default standard analyzer.

Of course, by passing text through the english analyzer, we lose information:

GET /my_index/_analyze?field=title 1

I'm not happy about the foxes

1

Emits token: i'm, happi, about, fox

We can’t tell if the document mentions one fox or many foxes; the word not is a stopword and is removed, so we can’t tell whether the document is happy about foxes or not. By using the english analyzer, we have increased recall as we can match more loosely, but we have reduced our ability to rank documents accurately.

To get the best of both worlds, we can use multifields to index the title field twice: once with the english analyzer and once with the standard analyzer:

PUT /my_index

{

"mappings": {

"blog": {

"properties": {

"title": { 1

"type": "string",

"fields": {

"english": { 2

"type": "string",

"analyzer": "english"

}

}

}

}

}

}

}

1

The main title field uses the standard analyzer.

2

The title.english subfield uses the english analyzer.

With this mapping in place, we can index some test documents to demonstrate how to use both fields at query time:

PUT /my_index/blog/1

{ "title": "I'm happy for this fox" }

PUT /my_index/blog/2

{ "title": "I'm not happy about my fox problem" }

GET /_search

{

"query": {

"multi_match": {

"type": "most_fields", 1

"query": "not happy foxes",

"fields": [ "title", "title.english" ]

}

}

}

1

Use the most_fields query type to match the same text in as many fields as possible.

Even though neither of our documents contain the word foxes, both documents are returned as results thanks to the word stemming on the title.english field. The second document is ranked as more relevant, because the word not matches on the title field.

Configuring Language Analyzers

While the language analyzers can be used out of the box without any configuration, most of them do allow you to control aspects of their behavior, specifically:

Stem-word exclusion

Imagine, for instance, that users searching for the “World Health Organization” are instead getting results for “organ health.” The reason for this confusion is that both “organ” and “organization” are stemmed to the same root word: organ. Often this isn’t a problem, but in this particular collection of documents, this leads to confusing results. We would like to prevent the words organization and organizations from being stemmed.

Custom stopwords

The default list of stopwords used in English are as follows:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,

no, not, of, on, or, such, that, the, their, then, there, these,

they, this, to, was, will, with

The unusual thing about no and not is that they invert the meaning of the words that follow them. Perhaps we decide that these two words are important and that we shouldn’t treat them as stopwords.

To customize the behavior of the english analyzer, we need to create a custom analyzer that uses the english analyzer as its base but adds some configuration:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"my_english": {

"type": "english",

"stem_exclusion": [ "organization", "organizations" ], 1

"stopwords": [ 2

"a", "an", "and", "are", "as", "at", "be", "but", "by", "for",

"if", "in", "into", "is", "it", "of", "on", "or", "such", "that",

"the", "their", "then", "there", "these", "they", "this", "to",

"was", "will", "with"

]

}

}

}

}

}

GET /my_index/_analyze?analyzer=my_english 3

The World Health Organization does not sell organs.

1

Prevents organization and organizations from being stemmed

2

Specifies a custom list of stopwords

3

Emits tokens world, health, organization, does, not, sell, organ

We discuss stemming and stopwords in much more detail in Chapter 21 and Chapter 22, respectively.

Pitfalls of Mixing Languages

If you have to deal with only a single language, count yourself lucky. Finding the right strategy for handling documents written in several languages can be challenging.

At Index Time

Multilingual documents come in three main varieties:

§ One predominant language per document, which may contain snippets from other languages (See “One Language per Document”.)

§ One predominant language per field, which may contain snippets from other languages (See “One Language per Field”.)

§ A mixture of languages per field (See “Mixed-Language Fields”.)

The goal, although not always achievable, should be to keep languages separate. Mixing languages in the same inverted index can be problematic.

Incorrect stemming

The stemming rules for German are different from those for English, French, Swedish, and so on. Applying the same stemming rules to different languages will result in some words being stemmed correctly, some incorrectly, and some not being stemmed at all. It may even result in words from different languages with different meanings being stemmed to the same root word, conflating their meanings and producing confusing search results for the user.

Applying multiple stemmers in turn to the same text is likely to result in rubbish, as the next stemmer may try to stem an already stemmed word, compounding the problem.

STEMMER PER SCRIPT

The one exception to the only-one-stemmer rule occurs when each language is written in a different script. For instance, in Israel it is quite possible that a single document may contain Hebrew, Arabic, Russian (Cyrillic), and English:

אזהרה - Предупреждение - تحذير - Warning

Each language uses a different script, so the stemmer for one language will not interfere with another, allowing multiple stemmers to be applied to the same text.

Incorrect inverse document frequencies

In “What Is Relevance?”, we explained that the more frequently a term appears in a collection of documents, the less weight that term has. For accurate relevance calculations, you need accurate term-frequency statistics.

A short snippet of German appearing in predominantly English text would give more weight to the German words, given that they are relatively uncommon. But mix those with documents that are predominantly German, and the short German snippets now have much less weight.

At Query Time

It is not sufficient just to think about your documents, though. You also need to think about how your users will query those documents. Often you will be able to identify the main language of the user either from the language of that user’s chosen interface (for example, mysite.de versusmysite.fr) or from the accept-language HTTP header from the user’s browser.

User searches also come in three main varieties:

§ Users search for words in their main language.

§ Users search for words in a different language, but expect results in their main language.

§ Users search for words in a different language, and expect results in that language (for example, a bilingual person, or a foreign visitor in a web cafe).

Depending on the type of data that you are searching, it may be appropriate to return results in a single language (for example, a user searching for products on the Spanish version of the website) or to combine results in the identified main language of the user with results from other languages.

Usually, it makes sense to give preference to the user’s language. An English-speaking user searching the Web for “deja vu” would probably prefer to see the English Wikipedia page rather than the French Wikipedia page.

Identifying Language

You may already know the language of your documents. Perhaps your documents are created within your organization and translated into a list of predefined languages. Human pre-identification is probably the most reliable method of classifying language correctly.

Perhaps, though, your documents come from an external source without any language classification, or possibly with incorrect classification. In these cases, you need to use a heuristic to identify the predominant language. Fortunately, libraries are available in several languages to help with this problem.

Of particular note is the chromium-compact-language-detector library from Mike McCandless, which uses the open source (Apache License 2.0) Compact Language Detector (CLD) from Google. It is small, fast, and accurate, and can detect 160+ languages from as little as two sentences. It can even detect multiple languages within a single block of text. Bindings exist for several languages including Python, Perl, JavaScript, PHP, C#/.NET, and R.

Identifying the language of the user’s search request is not quite as simple. The CLD is designed for text that is at least 200 characters in length. Shorter amounts of text, such as search keywords, produce much less accurate results. In these cases, it may be preferable to take simple heuristics into account such as the country of origin, the user’s selected language, and the HTTP accept-language headers.

One Language per Document

A single predominant language per document requires a relatively simple setup. Documents from different languages can be stored in separate indices—blogs-en, blogs-fr, and so forth—that use the same type and the same fields for each index, just with different analyzers:

PUT /blogs-en

{

"mappings": {

"post": {

"properties": {

"title": {

"type": "string", 1

"fields": {

"stemmed": {

"type": "string",

"analyzer": "english" 2

}

}}}}}}

PUT /blogs-fr

{

"mappings": {

"post": {

"properties": {

"title": {

"type": "string", 1

"fields": {

"stemmed": {

"type": "string",

"analyzer": "french" 2

}

}}}}}}

1

Both blogs-en and blogs-fr have a type called post that contains the field title.

2

The title.stemmed subfield uses a language-specific analyzer.

This approach is clean and flexible. New languages are easy to add—just create a new index—and because each language is completely separate, we don’t suffer from the term-frequency and stemming problems described in “Pitfalls of Mixing Languages”.

The documents of a single language can be queried independently, or queries can target multiple languages by querying multiple indices. We can even specify a preference for particular languages with the indices_boost parameter:

GET /blogs-*/post/_search 1

{

"query": {

"multi_match": {

"query": "deja vu",

"fields": [ "title", "title.stemmed" ] 2

"type": "most_fields"

}

},

"indices_boost": { 3

"blogs-en": 3,

"blogs-fr": 2

}

}

1

This search is performed on any index beginning with blogs-.

2

The title.stemmed fields are queried using the analyzer specified in each index.

3

Perhaps the user’s accept-language headers showed a preference for English, and then French, so we boost results from each index accordingly. Any other languages will have a neutral boost of 1.

Foreign Words

Of course, these documents may contain words or sentences in other languages, and these words are unlikely to be stemmed correctly. With predominant-language documents, this is not usually a major problem. The user will often search for the exact words—for instance, of a quotation from another language—rather than for inflections of a word. Recall can be improved by using techniques explained in Chapter 20.

Perhaps some words like place names should be queryable in the predominant language and in the original language, such as Munich and München. These words are effectively synonyms, which we discuss in Chapter 23.

DON’T USE TYPES FOR LANGUAGES

You may be tempted to use a separate type for each language, instead of a separate index. For best results, you should avoid using types for this purpose. As explained in “Types and Mappings”, fields from different types but with the same field name are indexed into the same inverted index. This means that the term frequencies from each type (and thus each language) are mixed together.

To ensure that the term frequencies of one language don’t pollute those of another, either use a separate index for each language, or a separate field, as explained in the next section.

One Language per Field

For documents that represent entities like products, movies, or legal notices, it is common for the same text to be translated into several languages. Although each translation could be represented in a single document in an index per language, another reasonable approach is to keep all translations in the same document:

{

"title": "Fight club",

"title_br": "Clube de Luta",

"title_cz": "Klub rvácu",

"title_en": "Fight club",

"title_es": "El club de la lucha",

...

}

Each translation is stored in a separate field, which is analyzed according to the language it contains:

PUT /movies

{

"mappings": {

"movie": {

"properties": {

"title": { 1

"type": "string"

},

"title_br": { 2

"type": "string",

"analyzer": "brazilian"

},

"title_cz": { 2

"type": "string",

"analyzer": "czech"

},

"title_en": { 2

"type": "string",

"analyzer": "english"

},

"title_es": { 2

"type": "string",

"analyzer": "spanish"

}

}

}

}

}

1

The title field contains the original title and uses the standard analyzer.

2

Each of the other fields uses the appropriate analyzer for that language.

Like the index-per-language approach, the field-per-language approach maintains clean term frequencies. It is not quite as flexible as having separate indices. Although it is easy to add a new field by using the update-mapping API, those new fields may require new custom analyzers, which can only be set up at index creation time. As a workaround, you can close the index, add the new analyzers with the update-settings API, then reopen the index, but closing the index means that it will require some downtime.

The documents of a single language can be queried independently, or queries can target multiple languages by querying multiple fields. We can even specify a preference for particular languages by boosting that field:

GET /movies/movie/_search

{

"query": {

"multi_match": {

"query": "club de la lucha",

"fields": [ "title*", "title_es^2" ], 1

"type": "most_fields"

}

}

}

1

This search queries any field beginning with title but boosts the title_es field by 2. All other fields have a neutral boost of 1.

Mixed-Language Fields

Usually, documents that mix multiple languages in a single field come from sources beyond your control, such as pages scraped from the Web:

{ "body": "Page not found / Seite nicht gefunden / Page non trouvée" }

They are the most difficult type of multilingual document to handle correctly. Although you can simply use the standard analyzer on all fields, your documents will be less searchable than if you had used an appropriate stemmer. But of course, you can’t choose just one stemmer—stemmers are language specific. Or rather, stemmers are language and script specific. As discussed in “Stemmer per Script”, if every language uses a different script, then stemmers can be combined.

Assuming that your mix of languages uses the same script such as Latin, you have three choices available to you:

§ Split into separate fields

§ Analyze multiple times

§ Use n-grams

Split into Separate Fields

The Compact Language Detector mentioned in “Identifying Language” can tell you which parts of the document are in which language. You can split up the text based on language and use the same approach as was used in “One Language per Field”.

Analyze Multiple Times

If you primarily deal with a limited number of languages, you could use multi-fields to analyze the text once per language:

PUT /movies

{

"mappings": {

"title": {

"properties": {

"title": { 1

"type": "string",

"fields": {

"de": { 2

"type": "string",

"analyzer": "german"

},

"en": { 2

"type": "string",

"analyzer": "english"

},

"fr": { 2

"type": "string",

"analyzer": "french"

},

"es": { 2

"type": "string",

"analyzer": "spanish"

}

}

}

}

}

}

}

1

The main title field uses the standard analyzer.

2

Each subfield applies a different language analyzer to the text in the title field.

Use n-grams

You could index all words as n-grams, using the same approach as described in “Ngrams for Compound Words”. Most inflections involve adding a suffix (or in some languages, a prefix) to a word, so by breaking each word into n-grams, you have a good chance of matching words that are similar but not exactly the same. This can be combined with the analyze-multiple times approach to provide a catchall field for unsupported languages:

PUT /movies

{

"settings": {

"analysis": {...} 1

},

"mappings": {

"title": {

"properties": {

"title": {

"type": "string",

"fields": {

"de": {

"type": "string",

"analyzer": "german"

},

"en": {

"type": "string",

"analyzer": "english"

},

"fr": {

"type": "string",

"analyzer": "french"

},

"es": {

"type": "string",

"analyzer": "spanish"

},

"general": { 2

"type": "string",

"analyzer": "trigrams"

}

}

}

}

}

}

}

1

In the analysis section, we define the same trigrams analyzer as described in “Ngrams for Compound Words”.

2

The title.general field uses the trigrams analyzer to index any language.

When querying the catchall general field, you can use minimum_should_match to reduce the number of low-quality matches. It may also be necessary to boost the other fields slightly more than the general field, so that matches on the the main language fields are given more weight than those on the general field:

GET /movies/movie/_search

{

"query": {

"multi_match": {

"query": "club de la lucha",

"fields": [ "title*^1.5", "title.general" ], 1

"type": "most_fields",

"minimum_should_match": "75%" 2

}

}

}

1

All title or title.* fields are given a slight boost over the title.general field.

2

The minimum_should_match parameter reduces the number of low-quality matches returned, especially important for the title.general field.