Stopwords: Performance Versus Precision - Dealing with Human Language - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part III. Dealing with Human Language

Chapter 22. Stopwords: Performance Versus Precision

Back in the early days of information retrieval, disk space and memory were limited to a tiny fraction of what we are accustomed to today. It was essential to make your index as small as possible. Every kilobyte saved meant a significant improvement in performance. Stemming (seeChapter 21) was important, not just for making searches broader and increasing retrieval in the same way that we use it today, but also as a tool for compressing index size.

Another way to reduce index size is simply to index fewer words. For search purposes, some words are more important than others. A significant reduction in index size can be achieved by indexing only the more important terms.

So which terms can be left out? We can divide terms roughly into two groups:

Low-frequency terms

Words that appear in relatively few documents in the collection. Because of their rarity, they have a high value, or weight.

High-frequency terms

Common words that appear in many documents in the index, such as the, and, and is. These words have a low weight and contribute little to the relevance score.

TIP

Of course, frequency is really a scale rather than just two points labeled low and high. We just draw a line at some arbitrary point and say that any terms below that line are low frequency and above the line are high frequency.

Which terms are low or high frequency depend on the documents themselves. The word and may be a low-frequency term if all the documents are in Chinese. In a collection of documents about databases, the word database may be a high-frequency term with little value as a search term for that particular collection.

That said, for any language there are words that occur very commonly and that seldom add value to a search. The default English stopwords used in Elasticsearch are as follows:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it,

no, not, of, on, or, such, that, the, their, then, there, these,

they, this, to, was, will, with

These stopwords can usually be filtered out before indexing with little negative impact on retrieval. But is it a good idea to do so?

Pros and Cons of Stopwords

We have more disk space, more RAM, and better compression algorithms than existed back in the day. Excluding the preceding 33 common words from the index will save only about 4MB per million documents. Using stopwords for the sake of reducing index size is no longer a valid reason. (However, there is one caveat to this statement, which we discuss in “Stopwords and Phrase Queries”.)

On top of that, by removing words from the index, we are reducing our ability to perform certain types of searches. Filtering out the words listed previously prevents us from doing the following:

§ Distinguishing happy from not happy.

§ Searching for the band The The.

§ Finding Shakespeare’s quotation “To be, or not to be”

§ Using the country code for Norway: no

The primary advantage of removing stopwords is performance. Imagine that we search an index with one million documents for the word fox. Perhaps fox appears in only 20 of them, which means that Elastisearch has to calculate the relevance _score for 20 documents in order to return the top 10. Now, we change that to a search for the OR fox. The word the probably occurs in almost all the documents, which means that Elasticsearch has to calculate the _score for all one million documents. This second query simply cannot perform as well as the first.

Fortunately, there are techniques that we can use to keep common words searchable, while still maintaining good performance. First, we’ll start with how to use stopwords.

Using Stopwords

The removal of stopwords is handled by the stop token filter which can be used when creating a custom analyzer (see “Using the stop Token Filter”). However, some out-of-the-box analyzers come with the stop filter pre-integrated:

Language analyzers

Each language analyzer defaults to using the appropriate stopwords list for that language. For instance, the english analyzer uses the _english_ stopwords list.

standard analyzer

Defaults to the empty stopwords list: _none_, essentially disabling stopwords.

pattern analyzer

Defaults to _none_, like the standard analyzer.

Stopwords and the Standard Analyzer

To use custom stopwords in conjunction with the standard analyzer, all we need to do is to create a configured version of the analyzer and pass in the list of stopwords that we require:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"my_analyzer": { 1

"type": "standard", 2

"stopwords": [ "and", "the" ] 3

}

}

}

}

}

1

This is a custom analyzer called my_analyzer.

2

This analyzer is the standard analyzer with some custom configuration.

3

The stopwords to filter out are and and the.

TIP

This same technique can be used to configure custom stopword lists for any of the language analyzers.

Maintaining Positions

The output from the analyze API is quite interesting:

GET /my_index/_analyze?analyzer=my_analyzer

The quick and the dead

{

"tokens": [

{

"token": "quick",

"start_offset": 4,

"end_offset": 9,

"type": "<ALPHANUM>",

"position": 2 1

},

{

"token": "dead",

"start_offset": 18,

"end_offset": 22,

"type": "<ALPHANUM>",

"position": 5 1

}

]

}

1

Note the position of each token.

The stopwords have been filtered out, as expected, but the interesting part is that the position of the two remaining terms is unchanged: quick is the second word in the original sentence, and dead is the fifth. This is important for phrase queries—if the positions of each term had been adjusted, a phrase query for quick dead would have matched the preceding example incorrectly.

Specifying Stopwords

Stopwords can be passed inline, as we did in the previous example, by specifying an array:

"stopwords": [ "and", "the" ]

The default stopword list for a particular language can be specified using the _lang_ notation:

"stopwords": "_english_"

TIP

The predefined language-specific stopword lists available in Elasticsearch can be found in the stop token filter documentation.

Stopwords can be disabled by specifying the special list: _none_. For instance, to use the english analyzer without stopwords, you can do the following:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"my_english": {

"type": "english", 1

"stopwords": "_none_" 2

}

}

}

}

}

1

The my_english analyzer is based on the english analyzer.

2

But stopwords are disabled.

Finally, stopwords can also be listed in a file with one word per line. The file must be present on all nodes in the cluster, and the path can be specified with the stopwords_path parameter:

PUT /my_index

{

"settings": {

"analysis": {

"analyzer": {

"my_english": {

"type": "english",

"stopwords_path": "stopwords/english.txt" 1

}

}

}

}

}

1

The path to the stopwords file, relative to the Elasticsearch config directory

Using the stop Token Filter

The stop token filter can be combined with a tokenizer and other token filters when you need to create a custom analyzer. For instance, let’s say that we wanted to create a Spanish analyzer with the following:

§ A custom stopwords list

§ The light_spanish stemmer

§ The asciifolding filter to remove diacritics

We could set that up as follows:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"spanish_stop": {

"type": "stop",

"stopwords": [ "si", "esta", "el", "la" ] 1

},

"light_spanish": { 2

"type": "stemmer",

"language": "light_spanish"

}

},

"analyzer": {

"my_spanish": {

"tokenizer": "spanish",

"filter": [ 3

"lowercase",

"asciifolding",

"spanish_stop",

"light_spanish"

]

}

}

}

}

}

1

The stop token filter takes the same stopwords and stopwords_path parameters as the standard analyzer.

2

See “Algorithmic Stemmers”.

3

The order of token filters is important, as explained next.

We have placed the spanish_stop filter after the asciifolding filter. This means that esta, ésta, and está will first have their diacritics removed to become just esta, which will then be removed as a stopword. If, instead, we wanted to remove esta and ésta, but not está, we would have to put the spanish_stop filter before the asciifolding filter, and specify both words in the stopwords list.

Updating Stopwords

A few techniques can be used to update the list of stopwords used by an analyzer. Analyzers are instantiated at index creation time, when a node is restarted, or when a closed index is reopened.

If you specify stopwords inline with the stopwords parameter, your only option is to close the index and update the analyzer configuration with the update index settings API, then reopen the index.

Updating stopwords is easier if you specify them in a file with the stopwords_path parameter. You can just update the file (on every node in the cluster) and then force the analyzers to be re-created by either of these actions:

§ Closing and reopening the index (see open/close index), or

§ Restarting each node in the cluster, one by one

Of course, updating the stopwords list will not change any documents that have already been indexed. It will apply only to searches and to new or updated documents. To apply the changes to existing documents, you will need to reindex your data. See “Reindexing Your Data”.

Stopwords and Performance

The biggest disadvantage of keeping stopwords is that of performance. When Elasticsearch performs a full-text search, it has to calculate the relevance _score on all matching documents in order to return the top 10 matches.

While most words typically occur in much fewer than 0.1% of all documents, a few words such as the may occur in almost all of them. Imagine you have an index of one million documents. A query for quick brown fox may match fewer than 1,000 documents. But a query for the quick brown fox has to score and sort almost all of the one million documents in your index, just in order to return the top 10!

The problem is that the quick brown fox is really a query for the OR quick OR brown OR fox—any document that contains nothing more than the almost meaningless term the is included in the result set. What we need is a way of reducing the number of documents that need to be scored.

and Operator

The easiest way to reduce the number of documents is simply to use the and operator with the match query, in order to make all words required.

A match query like this:

{

"match": {

"text": {

"query": "the quick brown fox",

"operator": "and"

}

}

}

is rewritten as a bool query like this:

{

"bool": {

"must": [

{ "term": { "text": "the" }},

{ "term": { "text": "quick" }},

{ "term": { "text": "brown" }},

{ "term": { "text": "fox" }}

]

}

}

The bool query is intelligent enough to execute each term query in the optimal order—it starts with the least frequent term. Because all terms are required, only documents that contain the least frequent term can possibly match. Using the and operator greatly speeds up multiterm queries.

minimum_should_match

In “Controlling Precision”, we discussed using the minimum_should_match operator to trim the long tail of less-relevant results. It is useful for this purpose alone but, as a nice side effect, it offers a similar performance benefit to the and operator:

{

"match": {

"text": {

"query": "the quick brown fox",

"minimum_should_match": "75%"

}

}

}

In this example, at least three out of the four terms must match. This means that the only docs that need to be considered are those that contain either the least or second least frequent terms.

This offers a huge performance gain over a simple query with the default or operator! But we can do better yet…

Divide and Conquer

The terms in a query string can be divided into more-important (low-frequency) and less-important (high-frequency) terms. Documents that match only the less important terms are probably of very little interest. Really, we want documents that match as many of the more important terms as possible.

The match query accepts a cutoff_frequency parameter, which allows it to divide the terms in the query string into a low-frequency and high-frequency group. The low-frequency group (more-important terms) form the bulk of the query, while the high-frequency group (less-important terms) is used only for scoring, not for matching. By treating these two groups differently, we can gain a real boost of speed on previously slow queries.

DOMAIN-SPECIFIC STOPWORDS

One of the benefits of cutoff_frequency is that you get domain-specific stopwords for free. For instance, a website about movies may use the words movie, color, black, and white so often that they could be considered almost meaningless. With the stop token filter, these domain-specific terms would have to be added to the stopwords list manually. However, because thecutoff_frequency looks at the actual frequency of terms in the index, these words would be classified as high frequency automatically.

Take this query as an example:

{

"match": {

"text": {

"query": "Quick and the dead",

"cutoff_frequency": 0.01 1

}

}

1

Any term that occurs in more than 1% of documents is considered to be high frequency. The cutoff_frequency can be specified as a fraction (0.01) or as an absolute number (5).

This query uses the cutoff_frequency to first divide the query terms into a low-frequency group (quick, dead) and a high-frequency group (and, the). Then, the query is rewritten to produce the following bool query:

{

"bool": {

"must": { 1

"bool": {

"should": [

{ "term": { "text": "quick" }},

{ "term": { "text": "dead" }}

]

}

},

"should": { 2

"bool": {

"should": [

{ "term": { "text": "and" }},

{ "term": { "text": "the" }}

]

}

}

}

}

1

At least one low-frequency/high-importance term must match.

2

High-frequency/low-importance terms are entirely optional.

The must clause means that at least one of the low-frequency terms—quick or dead—_must_ be present for a document to be considered a match. All other documents are excluded. The should clause then looks for the high-frequency terms and and the, but only in the documents collected by the must clause. The sole job of the should clause is to score a document like “Quick and the dead” higher than “_The_ quick but dead”. This approach greatly reduces the number of documents that need to be examined and scored.

TIP

Setting the operator parameter to and would make all low-frequency terms required, and score documents that contain all high-frequency terms higher. However, matching documents would not be required to contain all high-frequency terms. If you would prefer all low- and high-frequency terms to be required, you should use a bool query instead. As we saw in “and Operator”, this is already an efficient query.

Controlling Precision

The minimum_should_match parameter can be combined with cutoff_frequency but it applies to only the low-frequency terms. This query:

{

"match": {

"text": {

"query": "Quick and the dead",

"cutoff_frequency": 0.01,

"minimum_should_match": "75%"

}

}

would be rewritten as follows:

{

"bool": {

"must": {

"bool": {

"should": [

{ "term": { "text": "quick" }},

{ "term": { "text": "dead" }}

],

"minimum_should_match": 1 1

}

},

"should": { 2

"bool": {

"should": [

{ "term": { "text": "and" }},

{ "term": { "text": "the" }}

]

}

}

}

}

1

Because there are only two terms, the original 75% is rounded down to 1, that is: one out of two low-terms must match.

2

The high-frequency terms are still optional and used only for scoring.

Only High-Frequency Terms

An or query for high-frequency terms only—`‘To be, or not to be’'—is the worst case for performance. It is pointless to score all the documents that contain only one of these terms in order to return just the top 10 matches. We are really interested only in documents in which the terms all occur together, so in the case where there are no low-frequency terms, the query is rewritten to make all high-frequency terms required:

{

"bool": {

"must": [

{ "term": { "text": "to" }},

{ "term": { "text": "be" }},

{ "term": { "text": "or" }},

{ "term": { "text": "not" }},

{ "term": { "text": "to" }},

{ "term": { "text": "be" }}

]

}

}

More Control with Common Terms

While the high/low frequency functionality in the match query is useful, sometimes you want more control over how the high- and low-frequency groups should be handled. The match query exposes a subset of the functionality available in the common terms query.

For instance, we could make all low-frequency terms required, and score only documents that have 75% of all high-frequency terms with a query like this:

{

"common": {

"text": {

"query": "Quick and the dead",

"cutoff_frequency": 0.01,

"low_freq_operator": "and",

"minimum_should_match": {

"high_freq": "75%"

}

}

}

}

See the common terms query reference page for more options.

Stopwords and Phrase Queries

About 5% of all queries are phrase queries (see “Phrase Matching”), but they often account for the majority of slow queries. Phrase queries can perform poorly, especially if the phrase includes very common words; a phrase like “To be, or not to be” could be considered pathological. The reason for this has to do with the amount of data that is necessary to support proximity matching.

In “Pros and Cons of Stopwords”, we said that removing stopwords saves only a small amount of space in the inverted index. That was only partially true. A typical index may contain, among other data, some or all of the following:

Terms dictionary

A sorted list of all terms that appear in the documents in the index, and a count of the number of documents that contain each term.

Postings list

A list of which documents contain each term.

Term frequency

How often each term appears in each document.

Positions

The position of each term within each document, for phrase and proximity queries.

Offsets

The start and end character offsets of each term in each document, for snippet highlighting. Disabled by default.

Norms

A factor used to normalize fields of different lengths, to give shorter fields more weight.

Removing stopwords from the index may save a small amount of space in the terms dictionary and the postings list, but positions and offsets are another matter. Positions and offsets data can easily double, triple, or quadruple index size.

Positions Data

Positions are enabled on analyzed string fields by default, so that phrase queries will work out of the box. The more often that a term appears, the more space is needed to store its position data. Very common words, by definition, appear very commonly, and their positions data can run to megabytes or gigabytes on large collections.

Running a phrase query on a high-frequency word like the might result in gigabytes of data being read from disk. That data will be stored in the kernel filesystem cache to speed up later access, which seems like a good thing, but it might cause other data to be evicted from the cache, which will slow subsequent queries.

This is clearly a problem that needs solving.

Index Options

The first question you should ask yourself is: Do you need phrase or proximity queries?

Often, the answer is no. For many use cases, such as logging, you need to know whether a term appears in a document — information that is provided by the postings list—but not where it appears. Or perhaps you need to use phrase queries on one or two fields, but you can disable positions data on all of the other analyzed string fields.

The index_options parameter allows you to control what information is stored in the index for each field. Valid values are as follows:

docs

Only store which documents contain which terms. This is the default for not_analyzed string fields.

freqs

Store docs information, plus how often each term appears in each document. Term frequencies are needed for complete TF/IDF relevance calculations, but they are not required if you just need to know whether a document contains a particular term.

positions

Store docs and freqs, plus the position of each term in each document. This is the default for analyzed string fields, but can be disabled if phrase/proximity matching is not needed.

offsets

Store docs, freqs, positions, and the start and end character offsets of each term in the original string. This information is used by the postings highlighter but is disabled by default.

You can set index_options on fields added at index creation time, or when adding new fields by using the put-mapping API. This setting can’t be changed on existing fields:

PUT /my_index

{

"mappings": {

"my_type": {

"properties": {

"title": { 1

"type": "string"

},

"content": { 2

"type": "string",

"index_options": "freqs"

}

}

}

}

1

The title field uses the default setting of positions, so it is suitable for phrase/proximity queries.

2

The content field has positions disabled and so cannot be used for phrase/proximity queries.

Stopwords

Removing stopwords is one way of reducing the size of the positions data quite dramatically. An index with stopwords removed can still be used for phrase queries because the original positions of the remaining terms are maintained, as we saw in “Maintaining Positions”. But of course, excluding terms from the index reduces searchability. We wouldn’t be able to differentiate between the two phrases Man in the moon and Man on the moon.

Fortunately, there is a way to have our cake and eat it: the common_grams token filter.

common_grams Token Filter

The common_grams token filter is designed to make phrase queries with stopwords more efficient. It is similar to the shingles token filter (see “Finding Associated Words”), which creates bigrams out of every pair of adjacent words. It is most easily explained by example.

The common_grams token filter produces different output depending on whether query_mode is set to false (for indexing) or to true (for searching), so we have to create two separate analyzers:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"index_filter": { 1

"type": "common_grams",

"common_words": "_english_" 2

},

"search_filter": { 1

"type": "common_grams",

"common_words": "_english_", 2

"query_mode": true

}

},

"analyzer": {

"index_grams": { 3

"tokenizer": "standard",

"filter": [ "lowercase", "index_filter" ]

},

"search_grams": { 3

"tokenizer": "standard",

"filter": [ "lowercase", "search_filter" ]

}

}

}

}

}

1

First we create two token filters based on the common_grams token filter: index_filter for index time (with query_mode set to the default false), and search_filter for query time (with query_mode set to true).

2

The common_words parameter accepts the same options as the stopwords parameter (see “Specifying Stopwords”). The filter also accepts a common_words_path parameter, which allows you to maintain the common words list in a file.

3

Then we use each filter to create an analyzer for index time and another for query time.

With our custom analyzers in place, we can create a field that will use the index_grams analyzer at index time:

PUT /my_index/_mapping/my_type

{

"properties": {

"text": {

"type": "string",

"index_analyzer": "index_grams", 1

"search_analyzer": "standard" 1

}

}

}

1

The text field uses the index_grams analyzer at index time, but defaults to using the standard analyzer at search time, for reasons we will explain next.

At Index Time

If we were to analyze the phrase The quick and brown fox with shingles, it would produce these terms:

Pos 1: the_quick

Pos 2: quick_and

Pos 3: and_brown

Pos 4: brown_fox

Our new index_grams analyzer produces the following terms instead:

Pos 1: the, the_quick

Pos 2: quick, quick_and

Pos 3: and, and_brown

Pos 4: brown

Pos 5: fox

All terms are output as unigrams—the, quick, and so forth—but if a word is a common word or is followed by a common word, then it also outputs a bigram in the same position as the unigram—the_quick, quick_and, and_brown.

Unigram Queries

Because the index contains unigrams, the field can be queried using the same techniques that we have used for any other field, for example:

GET /my_index/_search

{

"query": {

"match": {

"text": {

"query": "the quick and brown fox",

"cutoff_frequency": 0.01

}

}

}

}

The preceding query string is analyzed by the search_analyzer configured for the text field—the standard analyzer in this example—to produce the terms the, quick, and, brown, fox.

Because the index for the text field contains the same unigrams as produced by the standard analyzer, search functions as it would for any normal field.

Bigram Phrase Queries

However, when we come to do phrase queries, we can use the specialized search_grams analyzer to make the process much more efficient:

GET /my_index/_search

{

"query": {

"match_phrase": {

"text": {

"query": "The quick and brown fox",

"analyzer": "search_grams" 1

}

}

}

}

1

For phrase queries, we override the default search_analyzer and use the search_grams analyzer instead.

The search_grams analyzer would produce the following terms:

Pos 1: the_quick

Pos 2: quick_and

Pos 3: and_brown

Pos 4: brown

Pos 5: fox

The analyzer has stripped out all of the common word unigrams, leaving the common word bigrams and the low-frequency unigrams. Bigrams like the_quick are much less common than the single term the. This has two advantages:

§ The positions data for the_quick is much smaller than for the, so it is faster to read from disk and has less of an impact on the filesystem cache.

§ The term the_quick is much less common than the, so it drastically decreases the number of documents that have to be examined.

Two-Word Phrases

There is one further optimization. By far the majority of phrase queries consist of only two words. If one of those words happens to be a common word, such as

GET /my_index/_search

{

"query": {

"match_phrase": {

"text": {

"query": "The quick",

"analyzer": "search_grams"

}

}

}

}

then the search_grams analyzer outputs a single token: the_quick. This transforms what originally could have been an expensive phrase query for the and quick into a very efficient single-term lookup.

Stopwords and Relevance

The last topic to cover before moving on from stopwords is that of relevance. Leaving stopwords in your index could make the relevance calculation less accurate, especially if your documents are very long.

As we have already discussed in “Term-frequency saturation”, the reason for this is that term-frequency/inverse document frequency doesn’t impose an upper limit on the impact of term frequency. Very common words may have a low weight because of inverse document frequency but, in long documents, the sheer number of occurrences of stopwords in a single document may lead to their weight being artificially boosted.

You may want to consider using the Okapi BM25 similarity on long fields that include stopwords instead of the default Lucene similarity.