Typoes and Mispelings - Dealing with Human Language - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part III. Dealing with Human Language

Chapter 24. Typoes and Mispelings

We expect a query on structured data like dates and prices to return only documents that match exactly. However, good full-text search shouldn’t have the same restriction. Instead, we can widen the net to include words that may match, but use the relevance score to push the better matches to the top of the result set.

In fact, full-text search that only matches exactly will probably frustrate your users. Wouldn’t you expect a search for “quick brown fox” to match a document containing “fast brown foxes,” “Johnny Walker” to match “Johnnie Walker,” or “Arnold Shcwarzenneger” to match “Arnold Schwarzenegger”?

If documents exist that do contain exactly what the user has queried, they should appear at the top of the result set, but weaker matches can be included further down the list. If no documents match exactly, at least we can show the user potential matches; they may even be what the user originally intended!

We have already looked at diacritic-free matching in Chapter 20, word stemming in Chapter 21, and synonyms in Chapter 23, but all of those approaches presuppose that words are spelled correctly, or that there is only one way to spell each word.

Fuzzy matching allows for query-time matching of misspelled words, while phonetic token filters at index time can be used for sounds-like matching.

Fuzziness

Fuzzy matching treats two words that are “fuzzily” similar as if they were the same word. First, we need to define what we mean by fuzziness.

In 1965, Vladimir Levenshtein developed the Levenshtein distance, which measures the number of single-character edits required to transform one word into the other. He proposed three types of one-character edits:

§ Substitution of one character for another: _f_ox → _b_ox

§ Insertion of a new character: sic → sic_k_

§ Deletion of a character:: b_l_ack → back

Frederick Damerau later expanded these operations to include one more:

§ Transposition of two adjacent characters: _st_ar → _ts_ar

For example, to convert the word bieber into beaver requires the following steps:

1. Substitute v for b: bie_b_er → bie_v_er

2. Substitute a for i: b_i_ever → b_a_ever

3. Transpose a and e: b_ae_ver → b_ea_ver

These three steps represent a Damerau-Levenshtein edit distance of 3.

Clearly, bieber is a long way from beaver—they are too far apart to be considered a simple misspelling. Damerau observed that 80% of human misspellings have an edit distance of 1. In other words, 80% of misspellings could be corrected with a single edit to the original string.

Elasticsearch supports a maximum edit distance, specified with the fuzziness parameter, of 2.

Of course, the impact that a single edit has on a string depends on the length of the string. Two edits to the word hat can produce mad, so allowing two edits on a string of length 3 is overkill. The fuzziness parameter can be set to AUTO, which results in the following maximum edit distances:

§ 0 for strings of one or two characters

§ 1 for strings of three, four, or five characters

§ 2 for strings of more than five characters

Of course, you may find that an edit distance of 2 is still overkill, and returns results that don’t appear to be related. You may get better results, and better performance, with a maximum fuzziness of 1.

Fuzzy Query

The fuzzy query is the fuzzy equivalent of the term query. You will seldom use it directly yourself, but understanding how it works will help you to use fuzziness in the higher-level match query.

To understand how it works, we will first index some documents:

POST /my_index/my_type/_bulk

{ "index": { "_id": 1 }}

{ "text": "Surprise me!"}

{ "index": { "_id": 2 }}

{ "text": "That was surprising."}

{ "index": { "_id": 3 }}

{ "text": "I wasn't surprised."}

Now we can run a fuzzy query for the term surprize:

GET /my_index/my_type/_search

{

"query": {

"fuzzy": {

"text": "surprize"

}

}

}

The fuzzy query is a term-level query, so it doesn’t do any analysis. It takes a single term and finds all terms in the term dictionary that are within the specified fuzziness. The default fuzziness is AUTO.

In our example, surprize is within an edit distance of 2 from both surprise and surprised, so documents 1 and 3 match. We could reduce the matches to just surprise with the following query:

GET /my_index/my_type/_search

{

"query": {

"fuzzy": {

"text": {

"value": "surprize",

"fuzziness": 1

}

}

}

}

Improving Performance

The fuzzy query works by taking the original term and building a Levenshtein automaton—like a big graph representing all the strings that are within the specified edit distance of the original string.

The fuzzy query then uses the automation to step efficiently through all of the terms in the term dictionary to see if they match. Once it has collected all of the matching terms that exist in the term dictionary, it can compute the list of matching documents.

Of course, depending on the type of data stored in the index, a fuzzy query with an edit distance of 2 can match a very large number of terms and perform very badly. Two parameters can be used to limit the performance impact:

prefix_length

The number of initial characters that will not be “fuzzified.” Most spelling errors occur toward the end of the word, not toward the beginning. By using a prefix_length of 3, for example, you can signficantly reduce the number of matching terms.

max_expansions

If a fuzzy query expands to three or four fuzzy options, the new options may be meaningful. If it produces 1,000 options, they are essentially meaningless. Use max_expansions to limit the total number of options that will be produced. The fuzzy query will collect matching terms until it runs out of terms or reaches the max_expansions limit.

Fuzzy match Query

The match query supports fuzzy matching out of the box:

GET /my_index/my_type/_search

{

"query": {

"match": {

"text": {

"query": "SURPRIZE ME!",

"fuzziness": "AUTO",

"operator": "and"

}

}

}

}

The query string is first analyzed, to produce the terms [surprize, me], and then each term is fuzzified using the specified fuzziness.

Similarly, the multi_match query also supports fuzziness, but only when executing with type best_fields or most_fields:

GET /my_index/my_type/_search

{

"query": {

"multi_match": {

"fields": [ "text", "title" ],

"query": "SURPRIZE ME!",

"fuzziness": "AUTO"

}

}

}

Both the match and multi_match queries also support the prefix_length and max_expansions parameters.

TIP

Fuzziness works only with the basic match and multi_match queries. It doesn’t work with phrase matching, common terms, or cross_fields matches.

Scoring Fuzziness

Users love fuzzy queries. They assume that these queries will somehow magically find the right combination of proper spellings. Unfortunately, the truth is somewhat more prosaic.

Imagine that we have 1,000 documents containing “Schwarzenegger,” and just one document with the misspelling “Schwarzeneger.” According to the theory of term frequency/inverse document frequency, the misspelling is much more relevant than the correct spelling, because it appears in far fewer documents!

In other words, if we were to treat fuzzy matches like any other match, we would favor misspellings over correct spellings, which would make for grumpy users.

TIP

Fuzzy matching should not be used for scoring purposes—only to widen the net of matching terms in case there are misspellings.

By default, the match query gives all fuzzy matches the constant score of 1. This is sufficient to add potential matches onto the end of the result list, without interfering with the relevance scoring of nonfuzzy queries.

TIP

Fuzzy queries alone are much less useful than they initially appear. They are better used as part of a “bigger” feature, such as the search-as-you-type completion suggester or the did-you-mean phrase suggester.

Phonetic Matching

In a last, desperate, attempt to match something, anything, we could resort to searching for words that sound similar, even if their spelling differs.

Several algorithms exist for converting words into a phonetic representation. The Soundex algorithm is the granddaddy of them all, and most other phonetic algorithms are improvements or specializations of Soundex, such as Metaphone and Double Metaphone (which expands phonetic matching to languages other than English), Caverphone for matching names in New Zealand, the Beider-Morse algorithm, which adopts the Soundex algorithm for better matching of German and Yiddish names, and the Kölner Phonetik for better handling of German words.

The thing to take away from this list is that phonetic algorithms are fairly crude, and very specific to the languages they were designed for, usually either English or German. This limits their usefulness. Still, for certain purposes, and in combination with other techniques, phonetic matching can be a useful tool.

First, you will need to install the Phonetic Analysis plug-in from http://bit.ly/1CreKJQ on every node in the cluster, and restart each node.

Then, you can create a custom analyzer that uses one of the phonetic token filters and try it out:

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"dbl_metaphone": { 1

"type": "phonetic",

"encoder": "double_metaphone"

}

},

"analyzer": {

"dbl_metaphone": {

"tokenizer": "standard",

"filter": "dbl_metaphone" 2

}

}

}

}

}

1

First, configure a custom phonetic token filter that uses the double_metaphone encoder.

2

Then use the custom token filter in a custom analyzer.

Now we can test it with the analyze API:

GET /my_index/_analyze?analyzer=dbl_metaphone

Smith Smythe

Each of Smith and Smythe produce two tokens in the same position: SM0 and XMT. Running John, Jon, and Johnnie through the analyzer will all produce the two tokens JN and AN, while Jonathon results in the tokens JN0N and ANTN.

The phonetic analyzer can be used just like any other analyzer. First map a field to use it, and then index some data:

PUT /my_index/_mapping/my_type

{

"properties": {

"name": {

"type": "string",

"fields": {

"phonetic": { 1

"type": "string",

"analyzer": "dbl_metaphone"

}

}

}

}

}

PUT /my_index/my_type/1

{

"name": "John Smith"

}

PUT /my_index/my_type/2

{

"name": "Jonnie Smythe"

}

1

The name.phonetic field uses the custom dbl_metaphone analyzer.

The match query can be used for searching:

GET /my_index/my_type/_search

{

"query": {

"match": {

"name.phonetic": {

"query": "Jahnnie Smeeth",

"operator": "and"

}

}

}

}

This query returns both documents, demonstrating just how coarse phonetic matching is. Scoring with a phonetic algorithm is pretty much worthless. The purpose of phonetic matching is not to increase precision, but to increase recall—to spread the net wide enough to catch any documents that might possibly match.

It usually makes more sense to use phonetic algorithms when retrieving results which will be consumed and post-processed by another computer, rather than by human users.