Synonyms - Dealing with Human Language - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part III. Dealing with Human Language

Chapter 23. Synonyms

While stemming helps to broaden the scope of search by simplifying inflected words to their root form, synonyms broaden the scope by relating concepts and ideas. Perhaps no documents match a query for “English queen,” but documents that contain “British monarch” would probably be considered a good match.

A user might search for “the US” and expect to find documents that contain United States, USA, U.S.A., America, or the States. However, they wouldn’t expect to see results about the states of matter or state machines.

This example provides a valuable lesson. It demonstrates how simple it is for a human to distinguish between separate concepts, and how tricky it can be for mere machines. The natural tendency is to try to provide synonyms for every word in the language, to ensure that any document is findable with even the most remotely related terms.

This is a mistake. In the same way that we prefer light or minimal stemming to aggressive stemming, synonyms should be used only where necessary. Users understand why their results are limited to the words in their search query. They are less understanding when their results seems almost random.

Synonyms can be used to conflate words that have pretty much the same meaning, such as jump, leap, and hop, or pamphlet, leaflet, and brochure. Alternatively, they can be used to make a word more generic. For instance, bird could be used as a more general synonym for owl orpigeon, and adult could be used for man or woman.

Synonyms appear to be a simple concept but they are quite tricky to get right. In this chapter, we explain the mechanics of using synonyms and discuss the limitations and gotchas.

TIP

Synonyms are used to broaden the scope of what is considered a matching document. Just as with stemming or partial matching, synonym fields should not be used alone but should be combined with a query on a main field that contains the original text in unadulterated form. See “Most Fields” for an explanation of how to maintain relevance when using synonyms.

Using Synonyms

Synonyms can replace existing tokens or be added to the token stream by using the synonym token filter:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"my_synonym_filter": {

"type": "synonym", 1

"synonyms": [ 2

"british,english",

"queen,monarch"

]

}

},

"analyzer": {

"my_synonyms": {

"tokenizer": "standard",

"filter": [

"lowercase",

"my_synonym_filter" 3

]

}

}

}

}

}

1

First, we define a token filter of type synonym.

2

We discuss synonym formats in “Formatting Synonyms”.

3

Then we create a custom analyzer that uses the my_synonym_filter.

TIP

Synonyms can be specified inline with the synonyms parameter, or in a synonyms file that must be present on every node in the cluster. The path to the synonyms file should be specified with the synonyms_path parameter, and should be either absolute or relative to the Elasticsearch config directory. See “Updating Stopwords” for techniques that can be used to refresh the synonyms list.

Testing our analyzer with the analyze API shows the following:

GET /my_index/_analyze?analyzer=my_synonyms

Elizabeth is the English queen

Pos 1: (elizabeth)

Pos 2: (is)

Pos 3: (the)

Pos 4: (british,english) 1

Pos 5: (queen,monarch) 1

1

All synonyms occupy the same position as the original term.

A document like this will match queries for any of the following: English queen, British queen, English monarch, or British monarch. Even a phrase query will work, because the position of each term has been preserved.

TIP

Using the same synonym token filter at both index time and search time is redundant. If, at index time, we replace English with the two terms english and british, then at search time we need to search for only one of those terms. Alternatively, if we don’t use synonyms at index time, then at search time, we would need to convert a query for English into a query forenglish OR british.

Whether to do synonym expansion at search or index time can be a difficult choice. We will explore the options more in “Expand or contract”.

Formatting Synonyms

In their simplest form, synonyms are listed as comma-separated values:

"jump,leap,hop"

If any of these terms is encountered, it is replaced by all of the listed synonyms. For instance:

Original terms: Replaced by:

────────────────────────────────

jump → (jump,leap,hop)

leap → (jump,leap,hop)

hop → (jump,leap,hop)

Alternatively, with the => syntax, it is possible to specify a list of terms to match (on the left side), and a list of one or more replacements (on the right side):

"u s a,united states,united states of america => usa"

"g b,gb,great britain => britain,england,scotland,wales"

Original terms: Replaced by:

────────────────────────────────

u s a → (usa)

united states → (usa)

great britain → (britain,england,scotland,wales)

If multiple rules for the same synonyms are specified, they are merged together. The order of rules is not respected. Instead, the longest matching rule wins. Take the following rules as an example:

"united states => usa",

"united states of america => usa"

If these rules conflicted, Elasticsearch would turn United States of America into the terms (usa),(of),(america). Instead, the longest sequence wins, and we end up with just the term (usa).

Expand or contract

In “Formatting Synonyms”, we have seen that it is possible to replace synonyms by simple expansion, simple contraction, or generic expansion. We will look at the trade-offs of each of these techniques in this section.

TIP

This section deals with single-word synonyms only. Multiword synonyms add another layer of complexity and are discussed later in “Multiword Synonyms and Phrase Queries”.

Simple Expansion

With simple expansion, any of the listed synonyms is expanded into all of the listed synonyms:

"jump,hop,leap"

Expansion can be applied either at index time or at query time. Each has advantages (⬆)︎ and disadvantages (⬇)︎. When to use which comes down to performance versus flexibility.

Index time

Query time

Index size

⬇︎ Bigger index because all synonyms must be indexed.

⬆︎ Normal.

Relevance

⬇︎ All synonyms will have the same IDF (see “What Is Relevance?”), meaning that more commonly used words will have the same weight as less commonly used words.

⬆︎ The IDF for each synonym will be correct.

Performance

⬆︎ A query needs to find only the single term specified in the query string.

⬇︎ A query for a single term is rewritten to look up all synonyms, which decreases performance.

Flexibility

⬇︎ The synonym rules can’t be changed for existing documents. For the new rules to have effect, existing documents have to be reindexed.

⬆︎ Synonym rules can be updated without reindexing documents.

Simple Contraction

Simple contraction maps a group of synonyms on the left side to a single value on the right side:

"leap,hop => jump"

It must be applied both at index time and at query time, to ensure that query terms are mapped to the same single value that exists in the index.

This approach has some advantages and some disadvantages compared to the simple expansion approach:

Index size

⬆︎ The index size is normal, as only a single term is indexed.

Relevance

⬇︎ The IDF for all terms is the same, so you can’t distinguish between more commonly used words and less commonly used words.

Performance

⬆︎ A query needs to find only the single term that appears in the index.

Flexibility

⬆︎ New synonyms can be added to the left side of the rule and applied at query time. For instance, imagine that we wanted to add the word bound to the rule specified previously. The following rule would work for queries that contain bound or for newly added documents that containbound:

"leap,hop,bound => jump"

But we could expand the effect to also take into account existing documents that contain bound by writing the rule as follows:

"leap,hop,bound => jump,bound"

When you reindex your documents, you could revert to the previous rule to gain the performance benefit of querying only a single term.

Genre Expansion

Genre expansion is quite different from simple contraction or expansion. Instead of treating all synonyms as equal, genre expansion widens the meaning of a term to be more generic. Take these rules, for example:

"cat => cat,pet",

"kitten => kitten,cat,pet",

"dog => dog,pet"

"puppy => puppy,dog,pet"

By applying genre expansion at index time:

§ A query for kitten would find just documents about kittens.

§ A query for cat would find documents abouts kittens and cats.

§ A query for pet would find documents about kittens, cats, puppies, dogs, or pets.

Alternatively, by applying genre expansion at query time, a query for kitten would be expanded to return documents that mention kittens, cats, or pets specifically.

You could also have the best of both worlds by applying expansion at index time to ensure that the genres are present in the index. Then, at query time, you can choose to not apply synonyms (so that a query for kitten returns only documents about kittens) or to apply synonyms in order to match kittens, cats and pets (including the canine variety).

With the preceding example rules above, the IDF for kitten will be correct, while the IDF for cat and pet will be artificially deflated. However, this works in your favor—a genre-expanded query for kitten OR cat OR pet will rank documents with kitten highest, followed by documents with cat, and documents with pet would be right at the bottom.

Synonyms and The Analysis Chain

The example we showed in “Formatting Synonyms”, used u s a as a synonym. Why did we use that instead of U.S.A.? The reason is that the synonym token filter sees only the terms that the previous token filter or tokenizer has emitted.

Imagine that we have an analyzer that consists of the standard tokenizer, with the lowercase token filter followed by a synonym token filter. The analysis process for the text U.S.A. would look like this:

original string → "U.S.A."

standard tokenizer → (U),(S),(A)

lowercase token filter → (u),(s),(a)

synonym token filter → (usa)

If we had specified the synonym as U.S.A., it would never match anything because, by the time my_synonym_filter sees the terms, the periods have been removed and the letters have been lowercased.

This is an important point to consider. What if we want to combine synonyms with stemming, so that jumps, jumped, jump, leaps, leaped, and leap are all indexed as the single term jump? We could place the synonyms filter before the stemmer and list all inflections:

"jumps,jumped,leap,leaps,leaped => jump"

But the more concise way would be to place the synonyms filter after the stemmer, and to list just the root words that would be emitted by the stemmer:

"leap => jump"

Case-Sensitive Synonyms

Normally, synonym filters are placed after the lowercase token filter and so all synonyms are written in lowercase, but sometimes that can lead to odd conflations. For instance, a CAT scan and a cat are quite different, as are PET (positron emmision tomography) and a pet. For that matter, the surname Little is distinct from the adjective little (although if a sentence starts with the adjective, it will be uppercased anyway).

If you need use case to distinguish between word senses, you will need to place your synonym filter before the lowercase filter. Of course, that means that your synonym rules would need to list all of the case variations that you want to match (for example, Little,LITTLE,little).

Instead of that, you could have two synonym filters: one to catch the case-sensitive synonyms and one for all the case-insentive synonyms. For instance, the case-sensitive rules could look like this:

"CAT,CAT scan => cat_scan"

"PET,PET scan => pet_scan"

"Johnny Little,J Little => johnny_little"

"Johnny Small,J Small => johnny_small"

And the case-insentive rules could look like this:

"cat => cat,pet"

"dog => dog,pet"

"cat scan,cat_scan scan => cat_scan"

"pet scan,pet_scan scan => pet_scan"

"little,small"

The case-sensitive rules would CAT scan but would match only the CAT in CAT scan. For this reason, we have the odd-looking rule cat_scan scan in the case-insensitive list to catch bad replacements.

TIP

You can see how quickly it can get complicated. As always, the analyze API is your friend—use it to check that your analyzers are configured correctly. See “Testing Analyzers”.

Multiword Synonyms and Phrase Queries

So far, synonyms appear to be quite straightforward. Unfortunately, this is where things start to go wrong. For phrase queries to function correctly, Elasticsearch needs to know the position that each term occupies in the original text. Multiword synonyms can play havoc with term positions, especially when the injected synonyms are of differing lengths.

To demonstrate, we’ll create a synonym token filter that uses this rule:

"usa,united states,u s a,united states of america"

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"my_synonym_filter": {

"type": "synonym",

"synonyms": [

"usa,united states,u s a,united states of america"

]

}

},

"analyzer": {

"my_synonyms": {

"tokenizer": "standard",

"filter": [

"lowercase",

"my_synonym_filter"

]

}

}

}

}

}

GET /my_index/_analyze?analyzer=my_synonyms&text=

The United States is wealthy

The tokens emitted by the analyze request look like this:

Pos 1: (the)

Pos 2: (usa,united,u,united)

Pos 3: (states,s,states)

Pos 4: (is,a,of)

Pos 5: (wealthy,america)

If we were to index a document analyzed with synonyms as above, and then run a phrase query without synonyms, we’d have some surprising results. These phrases would not match:

§ The usa is wealthy

§ The united states of america is wealthy

§ The U.S.A. is wealthy

However, these phrases would:

§ United states is wealthy

§ Usa states of wealthy

§ The U.S. of wealthy

§ U.S. is america

If we were to use synonyms at query time instead, we would see even more-bizarre matches. Look at the output of this validate-query request:

GET /my_index/_validate/query?explain

{

"query": {

"match_phrase": {

"text": {

"query": "usa is wealthy",

"analyzer": "my_synonyms"

}

}

}

}

The explanation is as follows:

"(usa united u united) (is states s states) (wealthy a of) america"

This would match documents containg u is of america but wouldn’t match any document that didn’t contain the term america.

TIP

Multiword synonyms affect highlighting in a similar way. A query for USA could end up returning a highlighted snippet such as: “The United States is wealthy”.

Use Simple Contraction for Phrase Queries

The way to avoid this mess is to use simple contraction to inject a single term that represents all synonyms, and to use the same synonym token filter at query time:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"my_synonym_filter": {

"type": "synonym",

"synonyms": [

"united states,u s a,united states of america=>usa"

]

}

},

"analyzer": {

"my_synonyms": {

"tokenizer": "standard",

"filter": [

"lowercase",

"my_synonym_filter"

]

}

}

}

}

}

GET /my_index/_analyze?analyzer=my_synonyms

The United States is wealthy

The result of the preceding analyze request looks much more sane:

Pos 1: (the)

Pos 2: (usa)

Pos 3: (is)

Pos 5: (wealthy)

And repeating the validate-query request that we made previously yields a simple, sane explanation:

"usa is wealthy"

The downside of this approach is that, by reducing united states of america down to the single term usa, you can’t use the same field to find just the word united or states. You would need to use a separate field with a different analysis chain for that purpose.

Synonyms and the query_string Query

We have tried to avoid discussing the query_string query because we don’t recommend using it. In More-Complicated Queries, we said that, because the query_string query supports a terse mini search-syntax, it could frequently lead to surprising results or even syntax errors.

One of the gotchas of this query involves multiword synonyms. To support its search-syntax, it has to parse the query string to recognize special operators like AND, OR, +, -, field:, and so forth. (See the full query_string syntax here.)

As part of this parsing process, it breaks up the query string on whitespace, and passes each word that it finds to the relevant analyzer separately. This means that your synonym analyzer will never receive a multiword synonym. Instead of seeing United States as a single string, the analyzer will receive United and States separately.

Fortunately, the trustworthy match query supports no such syntax, and multiword synonyms will be passed to the analyzer in their entirety.

Symbol Synonyms

The final part of this chapter is devoted to symbol synonyms, which are unlike the synonyms we have discussed until now. Symbol synonyms are string aliases used to represent symbols that would otherwise be removed during tokenization.

While most punctuation is seldom important for full-text search, character combinations like emoticons may be very signficant, even changing the meaning of the the text. Compare these:

§ I am thrilled to be at work on Sunday.

§ I am thrilled to be at work on Sunday :(

The standard tokenizer would simply strip out the emoticon in the second sentence, conflating two sentences that have quite different intent.

We can use the mapping character filter to replace emoticons with symbol synonyms like emoticon_happy and emoticon_sad before the text is passed to the tokenizer:

PUT /my_index

{

"settings": {

"analysis": {

"char_filter": {

"emoticons": {

"type": "mapping",

"mappings": [ 1

":)=>emoticon_happy",

":(=>emoticon_sad"

]

}

},

"analyzer": {

"my_emoticons": {

"char_filter": "emoticons",

"tokenizer": "standard",

"filter": [ "lowercase" ]

]

}

}

}

}

}

GET /my_index/_analyze?analyzer=my_emoticons

I am :) not :( 2

1

The mappings filter replaces the characters to the left of => with those to the right.

2

Emits tokens i, am, emoticon_happy, not, emoticon_sad.

It is unlikely that anybody would ever search for emoticon_happy, but ensuring that important symbols like emoticons are included in the index can be helpful when doing sentiment analysis. Of course, we could equally have used real words, like happy and sad.

TIP

The mapping character filter is useful for simple replacements of exact character sequences. For more-flexible pattern matching, you can use regular expressions with the pattern_replace character filter.