Reducing Words to Their Root Form - Dealing with Human Language - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part III. Dealing with Human Language

Chapter 21. Reducing Words to Their Root Form

Most languages of the world are inflected, meaning that words can change their form to express differences in the following:

§ Number: fox, foxes

§ Tense: pay, paid, paying

§ Gender: waiter, waitress

§ Person: hear, hears

§ Case: I, me, my

§ Aspect: ate, eaten

§ Mood: so be it, were it so

While inflection aids expressivity, it interferes with retrievability, as a single root word sense (or meaning) may be represented by many different sequences of letters. English is a weakly inflected language (you could ignore inflections and still get reasonable search results), but some other languages are highly inflected and need extra work in order to achieve high-quality search results.

Stemming attempts to remove the differences between inflected forms of a word, in order to reduce each word to its root form. For instance foxes may be reduced to the root fox, to remove the difference between singular and plural in the same way that we removed the difference between lowercase and uppercase.

The root form of a word may not even be a real word. The words jumping and jumpiness may both be stemmed to jumpi. It doesn’t matter—as long as the same terms are produced at index time and at search time, search will just work.

If stemming were easy, there would be only one implementation. Unfortunately, stemming is an inexact science that suffers from two issues: understemming and overstemming.

Understemming is the failure to reduce words with the same meaning to the same root. For example, jumped and jumps may be reduced to jump, while jumping may be reduced to jumpi. Understemming reduces retrieval relevant documents are not returned.

Overstemming is the failure to keep two words with distinct meanings separate. For instance, general and generate may both be stemmed to gener. Overstemming reduces precision: irrelevant documents are returned when they shouldn’t be.

LEMMATIZATION

A lemma is the canonical, or dictionary, form of a set of related words—the lemma of paying, paid, and pays is pay. Usually the lemma resembles the words it is related to but sometimes it doesn’t — the lemma of is, was, am, and being is be.

Lemmatization, like stemming, tries to group related words, but it goes one step further than stemming in that it tries to group words by their word sense, or meaning. The same word may represent two meanings—for example,wake can mean to wake up or a funeral. While lemmatization would try to distinguish these two word senses, stemming would incorrectly conflate them.

Lemmatization is a much more complicated and expensive process that needs to understand the context in which words appear in order to make decisions about what they mean. In practice, stemming appears to be just as effective as lemmatization, but with a much lower cost.

First we will discuss the two classes of stemmers available in Elasticsearch—“Algorithmic Stemmers” and “Dictionary Stemmers”—and then look at how to choose the right stemmer for your needs in “Choosing a Stemmer”. Finally, we will discuss options for tailoring stemming in“Controlling Stemming” and “Stemming in situ”.

Algorithmic Stemmers

Most of the stemmers available in Elasticsearch are algorithmic in that they apply a series of rules to a word in order to reduce it to its root form, such as stripping the final s or es from plurals. They don’t have to know anything about individual words in order to stem them.

These algorithmic stemmers have the advantage that they are available out of the box, are fast, use little memory, and work well for regular words. The downside is that they don’t cope well with irregular words like be, are, and am, or mice and mouse.

One of the earliest stemming algorithms is the Porter stemmer for English, which is still the recommended English stemmer today. Martin Porter subsequently went on to create the Snowball language for creating stemming algorithms, and a number of the stemmers available in Elasticsearch are written in Snowball.

TIP

The kstem token filter is a stemmer for English which combines the algorithmic approach with a built-in dictionary. The dictionary contains a list of root words and exceptions in order to avoid conflating words incorrectly. kstem tends to stem less aggressively than the Porter stemmer.

Using an Algorithmic Stemmer

While you can use the porter_stem or kstem token filter directly, or create a language-specific Snowball stemmer with the snowball token filter, all of the algorithmic stemmers are exposed via a single unified interface: the stemmer token filter, which accepts the language parameter.

For instance, perhaps you find the default stemmer used by the english analyzer to be too aggressive and you want to make it less aggressive. The first step is to look up the configuration for the english analyzer in the language analyzers documentation, which shows the following:

{

"settings": {

"analysis": {

"filter": {

"english_stop": {

"type": "stop",

"stopwords": "_english_"

},

"english_keywords": {

"type": "keyword_marker", 1

"keywords": []

},

"english_stemmer": {

"type": "stemmer",

"language": "english" 2

},

"english_possessive_stemmer": {

"type": "stemmer",

"language": "possessive_english" 2

}

},

"analyzer": {

"english": {

"tokenizer": "standard",

"filter": [

"english_possessive_stemmer",

"lowercase",

"english_stop",

"english_keywords",

"english_stemmer"

]

}

}

}

}

}

1

The keyword_marker token filter lists words that should not be stemmed. This defaults to the empty list.

2

The english analyzer uses two stemmers: the possessive_english and the english stemmer. The possessive stemmer removes 's from any words before passing them on to the english_stop, english_keywords, and english_stemmer.

Having reviewed the current configuration, we can use it as the basis for a new analyzer, with the following changes:

§ Change the english_stemmer from english (which maps to the porter_stem token filter) to light_english (which maps to the less aggressive kstem token filter).

§ Add the asciifolding token filter to remove any diacritics from foreign words.

§ Remove the keyword_marker token filter, as we don’t need it. (We discuss this in more detail in “Controlling Stemming”.)

Our new custom analyzer would look like this:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"english_stop": {

"type": "stop",

"stopwords": "_english_"

},

"light_english_stemmer": {

"type": "stemmer",

"language": "light_english" 1

},

"english_possessive_stemmer": {

"type": "stemmer",

"language": "possessive_english"

}

},

"analyzer": {

"english": {

"tokenizer": "standard",

"filter": [

"english_possessive_stemmer",

"lowercase",

"english_stop",

"light_english_stemmer", 1

"asciifolding" 2

]

}

}

}

}

}

1

Replaced the english stemmer with the less aggressive light_english stemmer

2

Added the asciifolding token filter

Dictionary Stemmers

Dictionary stemmers work quite differently from algorithmic stemmers. Instead of applying a standard set of rules to each word, they simply look up the word in the dictionary. Theoretically, they could produce much better results than an algorithmic stemmer. A dictionary stemmer should be able to do the following:

§ Return the correct root word for irregular forms such as feet and mice

§ Recognize the distinction between words that are similar but have different word senses—for example, organ and organization

In practice, a good algorithmic stemmer usually outperforms a dictionary stemmer. There are a couple of reasons this should be so:

Dictionary quality

A dictionary stemmer is only as good as its dictionary. The Oxford English Dictionary website estimates that the English language contains approximately 750,000 words (when inflections are included). Most English dictionaries available for computers contain about 10% of those.

The meaning of words changes with time. While stemming mobility to mobil may have made sense previously, it now conflates the idea of mobility with a mobile phone. Dictionaries need to be kept current, which is a time-consuming task. Often, by the time a dictionary has been made available, some of its entries are already out-of-date.

If a dictionary stemmer encounters a word not in its dictionary, it doesn’t know how to deal with it. An algorithmic stemmer, on the other hand, will apply the same rules as before, correctly or incorrectly.

Size and performance

A dictionary stemmer needs to load all words, all prefixes, and all suffixes into memory. This can use a significant amount of RAM. Finding the right stem for a word is often considerably more complex than the equivalent process with an algorithmic stemmer.

Depending on the quality of the dictionary, the process of removing prefixes and suffixes may be more or less efficient. Less-efficient forms can slow the stemming process significantly.

Algorithmic stemmers, on the other hand, are usually simple, small, and fast.

TIP

If a good algorithmic stemmer exists for your language, it is usually a better choice than a dictionary-based stemmer. Languages with poor (or nonexistent) algorithmic stemmers can use the Hunspell dictionary stemmer, which we discuss in the next section.

Hunspell Stemmer

Elasticsearch provides dictionary-based stemming via the hunspell token filter. Hunspell hunspell.sourceforge.net is the spell checker used by Open Office, LibreOffice, Chrome, Firefox, Thunderbird, and many other open and closed source projects.

Hunspell dictionaries can be obtained from the following:

§ extensions.openoffice.org: Download and unzip the .oxt extension file.

§ addons.mozilla.org: Download and unzip the .xpi addon file.

§ OpenOffice archive: Download and unzip the .zip file.

A Hunspell dictionary consists of two files with the same base name—such as en_US—but with one of two extensions:

.dic

Contains all the root words, in alphabetical order, plus a code representing all possible suffixes and prefixes (which collectively are known as affixes)

.aff

Contains the actual prefix or suffix transformation for each code listed in the .dic file

Installing a Dictionary

The Hunspell token filter looks for dictionaries within a dedicated Hunspell directory, which defaults to ./config/hunspell/. The .dic and .aff files should be placed in a subdirectory whose name represents the language or locale of the dictionaries. For instance, we could create a Hunspell stemmer for American English with the following layout:

config/

└ hunspell/ 1

└ en_US/ 2

├ en_US.dic

├ en_US.aff

└ settings.yml 3

1

The location of the Hunspell directory can be changed by setting indices.analysis.hunspell.dictionary.location in the config/elasticsearch.yml file.

2

en_US will be the name of the locale or language that we pass to the hunspell token filter.

3

Per-language settings file, described in the following section.

Per-Language Settings

The settings.yml file contains settings that apply to all of the dictionaries within the language directory, such as these:

---

ignore_case: true

strict_affix_parsing: true

The meaning of these settings is as follows:

ignore_case

Hunspell dictionaries are case sensitive by default: the surname Booker is a different word from the noun booker, and so should be stemmed differently. It may seem like a good idea to use the hunspell stemmer in case-sensitive mode, but that can complicate things:

§ A word at the beginning of a sentence will be capitalized, and thus appear to be a proper noun.

§ The input text may be all uppercase, in which case almost no words will be found.

§ The user may search for names in all lowercase, in which case no capitalized words will be found.

As a general rule, it is a good idea to set ignore_case to true.

strict_affix_parsing

The quality of dictionaries varies greatly. Some dictionaries that are available online have malformed rules in the .aff file. By default, Lucene will throw an exception if it can’t parse an affix rule. If you need to deal with a broken affix file, you can set strict_affix_parsing tofalse to tell Lucene to ignore the broken rules.

CUSTOM DICTIONARIES

If multiple dictionaries (.dic files) are placed in the same directory, they will be merged together at load time. This allows you to tailor the downloaded dictionaries with your own custom word lists:

config/

└ hunspell/

└ en_US/ 1

├ en_US.dic

├ en_US.aff 2

├ custom.dic

└ settings.yml

1

The custom and en_US dictionaries will be merged.

2

Multiple .aff files are not allowed, as they could use conflicting rules.

The format of the .dic and .aff files is discussed in “Hunspell Dictionary Format”.

Creating a Hunspell Token Filter

Once your dictionaries are installed on all nodes, you can define a hunspell token filter that uses them:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"en_US": {

"type": "hunspell",

"language": "en_US" 1

}

},

"analyzer": {

"en_US": {

"tokenizer": "standard",

"filter": [ "lowercase", "en_US" ]

}

}

}

}

}

1

The language has the same name as the directory where the dictionary lives.

You can test the new analyzer with the analyze API, and compare its output to that of the english analyzer:

GET /my_index/_analyze?analyzer=en_US 1

reorganizes

GET /_analyze?analyzer=english 2

reorganizes

1

Returns organize

2

Returns reorgan

An interesting property of the hunspell stemmer, as can be seen in the preceding example, is that it can remove prefixes as well as as suffixes. Most algorithmic stemmers remove suffixes only.

TIP

Hunspell dictionaries can consume a few megabytes of RAM. Fortunately, Elasticsearch creates only a single instance of a dictionary per node. All shards that use the same Hunspell analyzer share the same instance.

Hunspell Dictionary Format

While it is not necessary to understand the format of a Hunspell dictionary in order to use the hunspell tokenizer, understanding the format will help you write your own custom dictionaries. It is quite simple.

For instance, in the US English dictionary, the en_US.dic file contains an entry for the word analyze, which looks like this:

analyze/ADSG

The en_US.aff file contains the prefix or suffix rules for the A, G, D, and S flags. Each flag consists of a number of rules, only one of which should match. Each rule has the following format:

[type] [flag] [letters to remove] [letters to add] [condition]

For instance, the following is suffix (SFX) rule D. It says that, when a word ends in a consonant (anything but a, e, i, o, or u) followed by a y, it can have the y removed and ied added (for example, ready → readied).

SFX D y ied [^aeiou]y

The rules for the A, G, D, and S flags mentioned previously are as follows:

SFX D Y 4

SFX D 0 d e 1

SFX D y ied [^aeiou]y

SFX D 0 ed [^ey]

SFX D 0 ed [aeiou]y

SFX S Y 4

SFX S y ies [^aeiou]y

SFX S 0 s [aeiou]y

SFX S 0 es [sxzh]

SFX S 0 s [^sxzhy] 2

SFX G Y 2

SFX G e ing e 3

SFX G 0 ing [^e]

PFX A Y 1

PFX A 0 re . 4

1

analyze ends in an e, so it can become analyzed by adding a d.

2

analyze does not end in s, x, z, h, or y, so it can become analyzes by adding an s.

3

analyze ends in an e, so it can become analyzing by removing the e and adding ing.

4

The prefix re can be added to form reanalyze. This rule can be combined with the suffix rules to form reanalyzes, reanalyzed, reanalyzing.

More information about the Hunspell syntax can be found on the Hunspell documentation site.

Choosing a Stemmer

The documentation for the stemmer token filter lists multiple stemmers for some languages. For English we have the following:

english

The porter_stem token filter.

light_english

The kstem token filter.

minimal_english

The EnglishMinimalStemmer in Lucene, which removes plurals

lovins

The Snowball based Lovins stemmer, the first stemmer ever produced.

porter

The Snowball based Porter stemmer

porter2

The Snowball based Porter2 stemmer

possessive_english

The EnglishPossessiveFilter in Lucene which removes 's

Add to that list the Hunspell stemmer with the various English dictionaries that are available.

One thing is for sure: whenever more than one solution exists for a problem, it means that none of the solutions solves the problem adequately. This certainly applies to stemming — each stemmer uses a different approach that overstems and understems words to a different degree.

The stemmer documentation page highlights the recommended stemmer for each language in bold, usually because it offers a reasonable compromise between performance and quality. That said, the recommended stemmer may not be appropriate for all use cases. There is no single right answer to the question of which is the best stemmer — it depends very much on your requirements. There are three factors to take into account when making a choice: performance, quality, and degree.

Stemmer Performance

Algorithmic stemmers are typically four or five times faster than Hunspell stemmers. “Handcrafted” algorithmic stemmers are usually, but not always, faster than their Snowball equivalents. For instance, the porter_stem token filter is significantly faster than the Snowball implementation of the Porter stemmer.

Hunspell stemmers have to load all words, prefixes, and suffixes into memory, which can consume a few megabytes of RAM. Algorithmic stemmers, on the other hand, consist of a small amount of code and consume very little memory.

Stemmer Quality

All languages, except Esperanto, are irregular. While more-formal words tend to follow a regular pattern, the most commonly used words often have irregular rules. Some stemming algorithms have been developed over years of research and produce reasonably high-quality results. Others have been assembled more quickly with less research and deal only with the most common cases.

While Hunspell offers the promise of dealing precisely with irregular words, it often falls short in practice. A dictionary stemmer is only as good as its dictionary. If Hunspell comes across a word that isn’t in its dictionary, it can do nothing with it. Hunspell requires an extensive, high-quality, up-to-date dictionary in order to produce good results; dictionaries of this caliber are few and far between. An algorithmic stemmer, on the other hand, will happily deal with new words that didn’t exist when the designer created the algorithm.

If a good algorithmic stemmer is available for your language, it makes sense to use it rather than Hunspell. It will be faster, will consume less memory, and will generally be as good or better than the Hunspell equivalent.

If accuracy and customizability is important to you, and you need (and have the resources) to maintain a custom dictionary, then Hunspell gives you greater flexibility than the algorithmic stemmers. (See “Controlling Stemming” for customization techniques that can be used with any stemmer.)

Stemmer Degree

Different stemmers overstem and understem to a different degree. The light_ stemmers stem less aggressively than the standard stemmers, and the minimal_ stemmers less aggressively still. Hunspell stems aggressively.

Whether you want aggressive or light stemming depends on your use case. If your search results are being consumed by a clustering algorithm, you may prefer to match more widely (and, thus, stem more aggressively). If your search results are intended for human consumption, lighter stemming usually produces better results. Stemming nouns and adjectives is more important for search than stemming verbs, but this also depends on the language.

The other factor to take into account is the size of your document collection. With a small collection such as a catalog of 10,000 products, you probably want to stem more aggressively to ensure that you match at least some documents. If your collection is large, you likely will get good matches with lighter stemming.

Making a Choice

Start out with a recommended stemmer. If it works well enough, there is no need to change it. If it doesn’t, you will need to spend some time investigating and comparing the stemmers available for language in order to find the one that best suits your purposes.

Controlling Stemming

Out-of-the-box stemming solutions are never perfect. Algorithmic stemmers, especially, will blithely apply their rules to any words they encounter, perhaps conflating words that you would prefer to keep separate. Maybe, for your use case, it is important to keep skies and skiing as distinct words rather than stemming them both down to ski (as would happen with the english analyzer).

The keyword_marker and stemmer_override token filters allow us to customize the stemming process.

Preventing Stemming

The stem_exclusion parameter for language analyzers (see “Configuring Language Analyzers”) allowed us to specify a list of words that should not be stemmed. Internally, these language analyzers use the keyword_marker token filter to mark the listed words as keywords, which prevents subsequent stemming token filters from touching those words.

For instance, we can create a simple custom analyzer that uses the porter_stem token filter, but prevents the word skies from being stemmed:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"no_stem": {

"type": "keyword_marker",

"keywords": [ "skies" ] 1

}

},

"analyzer": {

"my_english": {

"tokenizer": "standard",

"filter": [

"lowercase",

"no_stem",

"porter_stem"

]

}

}

}

}

}

1

They keywords parameter could accept multiple words.

Testing it with the analyze API shows that just the word skies has been excluded from stemming:

GET /my_index/_analyze?analyzer=my_english

sky skies skiing skis 1

1

Returns: sky, skies, ski, ski

TIP

While the language analyzers allow us only to specify an array of words in the stem_exclusion parameter, the keyword_marker token filter also accepts a keywords_path parameter that allows us to store all of our keywords in a file. The file should contain one word per line, and must be present on every node in the cluster. See “Updating Stopwords” for tips on how to update this file.

Customizing Stemming

In the preceding example, we prevented skies from being stemmed, but perhaps we would prefer it to be stemmed to sky instead. The stemmer_override token filter allows us to specify our own custom stemming rules. At the same time, we can handle some irregular forms like stemming mice to mouse and feet to foot:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"custom_stem": {

"type": "stemmer_override",

"rules": [ 1

"skies=>sky",

"mice=>mouse",

"feet=>foot"

]

}

},

"analyzer": {

"my_english": {

"tokenizer": "standard",

"filter": [

"lowercase",

"custom_stem", 2

"porter_stem"

]

}

}

}

}

}

GET /my_index/_analyze?analyzer=my_english

The mice came down from the skies and ran over my feet 3

1

Rules take the form original=>stem.

2

The stemmer_override filter must be placed before the stemmer.

3

Returns the, mouse, came, down, from, the, sky, and, ran, over, my, foot.

TIP

Just as for the keyword_marker token filter, rules can be stored in a file whose location should be specified with the rules_path parameter.

Stemming in situ

For the sake of completeness, we will finish this chapter by explaining how to index stemmed words into the same field as unstemmed words. As an example, analyzing the sentence The quick foxes jumped would produce the following terms:

Pos 1: (the)

Pos 2: (quick)

Pos 3: (foxes,fox) 1

Pos 4: (jumped,jump) 1

1

The stemmed and unstemmed forms occupy the same position.

WARNING

Read “Is Stemming in situ a Good Idea” before using this approach.

To achieve stemming in situ, we will use the keyword_repeat token filter, which, like the keyword_marker token filter (see “Preventing Stemming”), marks each term as a keyword to prevent the subsequent stemmer from touching it. However, it also repeats the term in the same position, and this repeated term is stemmed.

Using the keyword_repeat token filter alone would result in the following:

Pos 1: (the,the) 1

Pos 2: (quick,quick) 1

Pos 3: (foxes,fox)

Pos 4: (jumped,jump)

1

The stemmed and unstemmed forms are the same, and so are repeated needlessly.

To prevent the useless repetition of terms that are the same in their stemmed and unstemmed forms, we add the unique token filter into the mix:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"unique_stem": {

"type": "unique",

"only_on_same_position": true 1

}

},

"analyzer": {

"in_situ": {

"tokenizer": "standard",

"filter": [

"lowercase",

"keyword_repeat", 2

"porter_stem",

"unique_stem" 3

]

}

}

}

}

}

1

The unique token filter is set to remove duplicate tokens only when they occur in the same position.

2

The keyword_repeat token filter must appear before the stemmer.

3

The unique_stem filter removes duplicate terms after the stemmer has done its work.

Is Stemming in situ a Good Idea

People like the idea of stemming in situ: “Why use an unstemmed field and a stemmed field if I can just use one combined field?” But is it a good idea? The answer is almost always no. There are two problems.

The first is the inability to separate exact matches from inexact matches. In this chapter, we have seen that words with different meanings are often conflated to the same stem word: organs and organization both stem to organ.

In “Using Language Analyzers”, we demonstrated how to combine a query on a stemmed field (to increase recall) with a query on an unstemmed field (to improve relevance). When the stemmed and unstemmed fields are separate, the contribution of each field can be tuned by boosting one field over another (see “Prioritizing Clauses”). If, instead, the stemmed and unstemmed forms appear in the same field, there is no way to tune your search results.

The second issue has to do with how the relevance score is calculated. In “What Is Relevance?”, we explained that part of the calculation depends on the inverse document frequency — how often a word appears in all the documents in our index. Using in situ stemming for a document that contains the text jump jumped jumps would result in these terms:

Pos 1: (jump)

Pos 2: (jumped,jump)

Pos 3: (jumps,jump)

While jumped and jumps appear once each and so would have the correct IDF, jump appears three times, greatly reducing its value as a search term in comparison with the unstemmed forms.

For these reasons, we recommend against using stemming in situ.