Elasticsearch: The Definitive Guide (2015)

Part IV. Aggregations

Chapter 33. Significant Terms

The significant_terms (SigTerms) aggregation is rather different from the rest of the aggregations. All the aggregations we have seen so far are essentially simple math operations. By combining the various building blocks, you can build sophisticated aggregations and reports about your data.

significant_terms has a different agenda. To some, it may even look a bit like machine learning. The significant_terms aggregation finds uncommonly common terms in your data-set.

What do we mean by uncommonly common? These are terms that are statistically unusual — data that appears more frequently than the background rate would suggest. These statistical anomalies are usually indicative of something interesting in your data.

For example, imagine you are in charge of detecting and tracking down credit card fraud. Customers call and complain about unusual transactions appearing on their credit card — their account has been compromised. These transactions are just symptoms of a larger problem. Somewhere in the recent past, a merchant has either knowingly stolen the customers’ credit card information, or has unknowingly been compromised themselves.

Your job is to find the common point of compromise. If you have 100 customers complaining of unusual transactions, those customers likely share a single merchant—and it is this merchant that is likely the source of blame.

Of course, it is a little more nuanced than just finding a merchant that all customers share. For example, many of the customers will have large merchants like Amazon in their recent transaction history. We can rule out Amazon, however, since many uncompromised credit cards also have Amazon as a recent merchant.

This is an example of a commonly common merchant. Everyone, whether compromised or not, shares the merchant. This makes it of little interest to us.

On the opposite end of the spectrum, you have tiny merchants such as the corner drug store. These are commonly uncommon—only one or two customers have transactions from the merchant. We can rule these out as well. Since all of the compromised cards did not interact with the merchant, we can be sure it was not to blame for the security breach.

What we want are uncommonly common merchants. These are merchants that every compromised card shares, but that are not well represented in the background noise of uncompromised cards. These merchants are statistical anomalies; they appear more frequently than they should. It is highly likely that these uncommonly common merchants are to blame.

significant_terms aggregation does just this. It analyzes your data and finds terms that appear with a frequency that is statistically anomalous compared to the background data.

What you do with this statistical anomaly depends on the data. With the credit card data, you might be looking for fraud. With ecommerce, you might be looking for an unidentified demographic so you can market to them more efficiently. If you are analyzing logs, you might find one server that throws a certain type of error more often than it should. The applications of significant_terms is nearly endless.

significant_terms Demo

Because the significant_terms aggregation works by analyzing statistics, you need to have a certain threshold of data for it to become effective. That means we won’t be able to index a small amount of example data for the demo.

Instead, we have a pre-prepared dataset of around 80,000 documents. This is saved as a snapshot (for more information about snapshots and restore, see “Backing Up Your Cluster”) in our public demo repository. You can “restore” this dataset into your cluster by using these commands:

PUT /_snapshot/sigterms

{

"type": "url",

"settings": {

"url": "http://download.elasticsearch.org/definitiveguide/sigterms_demo/"

}

GET /_snapshot/sigterms/_all

POST /_snapshot/sigterms/snapshot/_restore

GET /mlmovies,mlratings/_recovery

(Optional) Inspect the repository to learn details about available snapshots

Begin the Restore process. This will download two indices into your cluster: mlmovies and mlratings

(Optional) Monitor the Restore process using the Recovery API

NOTE

The dataset is around 50 MB and may take some time to download.

In this demo, we are going to look at movie ratings by users of MovieLens. At MovieLens, users make movie recommendations so other users can find new movies to watch. For this demo, we are going to recommend movies by using significant_terms based on an input movie.

Let’s take a look at some sample data, to get a feel for what we are working with. There are two indices in this dataset, mlmovies and mlratings. Let’s look at mlmovies first:

GET mlmovies/_search

{

"took": 4,

"timed_out": false,

"_shards": {...},

"hits": {

"total": 10681,

"max_score": 1,

"hits": [

{

"_index": "mlmovies",

"_type": "mlmovie",

"_id": "2",

"_score": 1,

"_source": {

"offset": 2,

"bytes": 34,

"title": "Jumanji (1995)"

}

....

Execute a search without a query, so that we can see a random sampling of docs.

Each document in mlmovies represents a single movie. The two important pieces of data are the _id of the movie and the title of the movie. You can ignore offset and bytes; they are artifacts of the process used to extract this data from the original CSV files. There are 10,681 movies in this dataset.

Now let’s look at mlratings:

GET mlratings/_search

{

"took": 3,

"timed_out": false,

"_shards": {...},

"hits": {

"total": 69796,

"max_score": 1,

"hits": [

{

"_index": "mlratings",

"_type": "mlrating",

"_id": "00IC-2jDQFiQkpD6vhbFYA",

"_score": 1,

"_source": {

"offset": 1,

"bytes": 108,

"movie": [122,185,231,292,

316,329,355,356,362,364,370,377,420,

466,480,520,539,586,588,589,594,616

"user": 1

}

...

Here we can see the recommendations of individual users. Each document represents a single user, denoted by the user ID field. The movie field holds a list of movies that this user watched and recommended.

Recommending Based on Popularity

The first strategy we could take is trying to recommend movies based on popularity. Given a particular movie, we find all users who recommended that movie. Then we aggregate all their recommendations and take the top five most popular.

We can express that easily with a terms aggregation and some filtering. Let’s look at Talladega Nights, a comedy about NASCAR racing starring Will Ferrell. Ideally, our recommender should find other comedies in a similar style (and more than likely also starring Will Ferrell).

First we need to find the Talladega Nights ID:

GET mlmovies/_search

{

"query": {

"match": {

"title": "Talladega Nights"

}

...

"hits": [

{

"_index": "mlmovies",

"_type": "mlmovie",

"_id": "46970",

"_score": 3.658795,

"_source": {

"offset": 9575,

"bytes": 74,

"title": "Talladega Nights: The Ballad of Ricky Bobby (2006)"

}

...

Talladega Nights is ID 46970.

Armed with the ID, we can now filter the ratings and apply our terms aggregation to find the most popular movies from people who also like Talladega Nights:

GET mlratings/_search?search_type=count

{

"query": {

"filtered": {

"filter": {

"term": {

"movie": 46970

}

"aggs": {

"most_popular": {

"terms": {

"field": "movie",

"size": 6

}

We execute our query on mlratings this time, and specify search_type=count since we are interested only in the aggregation results.

Apply a filter on the ID corresponding to Talladega Nights.

Finally, find the most popular movies by using a terms bucket.

We perform the search on the mlratings index, and apply a filter for the ID of Talladega Nights. Since aggregations operate on query scope, this will effectively filter the aggregation results to only the users who recommended Talladega Nights. Finally, we execute a terms aggregation to bucket the most popular movies. We are requesting the top six results, since it is likely that Talladega Nights itself will be returned as a hit (and we don’t want to recommend the same movie).

The results come back like so:

{

...

"aggregations": {

"most_popular": {

"buckets": [

{

"key": 46970,

"key_as_string": "46970",

"doc_count": 271

{

"key": 2571,

"key_as_string": "2571",

"doc_count": 197

{

"key": 318,

"key_as_string": "318",

"doc_count": 196

{

"key": 296,

"key_as_string": "296",

"doc_count": 183

{

"key": 2959,

"key_as_string": "2959",

"doc_count": 183

{

"key": 260,

"key_as_string": "260",

"doc_count": 90

}

]

}

...

We need to correlate these back to their original titles, which can be done with a simple filtered query:

GET mlmovies/_search

{

"query": {

"filtered": {

"filter": {

"ids": {

"values": [2571,318,296,2959,260]

}

And finally, we end up with the following list:

1. Matrix, The

2. Shawshank Redemption

3. Pulp Fiction

4. Fight Club

5. Star Wars Episode IV: A New Hope

OK—well that is certainly a good list! I like all of those movies. But that’s the problem: most everyone likes that list. Those movies are universally well-liked, which means they are popular on everyone’s recommendations. The list is basically a recommendation of popular movies, not recommendations related to Talladega Nights.

This is easily verified by running the aggregation again, but without the filter on Talladega Nights. This will give a top-five most popular movie list:

GET mlratings/_search?search_type=count

{

"aggs": {

"most_popular": {

"terms": {

"field": "movie",

"size": 5

}

This returns a list that is very similar:

1. Shawshank Redemption

2. Silence of the Lambs, The

3. Pulp Fiction

4. Forrest Gump

5. Star Wars Episode IV: A New Hope

Clearly, just checking the most popular movies is not sufficient to build a good, discriminating recommender.

Recommending Based on Statistics

Now that the scene is set, let’s try using significant_terms. significant_terms will analyze the group of people who enjoy Talladega Nights (the foreground group) and determine what movies are most popular. It will then construct a list of popular films for everyone (thebackground group) and compare the two.

The statistical anomalies will be the movies that are over-represented in the foreground compared to the background. Theoretically, this should be a list of comedies, since people who enjoy Will Ferrell comedies will recommend them at a higher rate than the background population of people.

Let’s give it a shot:

GET mlratings/_search?search_type=count

{

"query": {

"filtered": {

"filter": {

"term": {

"movie": 46970

}

"aggs": {

"most_sig": {

"significant_terms": {

"field": "movie",

"size": 6

}

The setup is nearly identical — we just use significant_terms instead of terms.

As you can see, the query is nearly the same. We filter for users who liked Talladega Nights; this forms the foreground group. By default, significant_terms will use the entire index as the background, so we don’t need to do anything special.

The results come back as a list of buckets similar to terms, but with some extra metadata:

...

"aggregations": {

"most_sig": {

"doc_count": 271,

"buckets": [

{

"key": 46970,

"key_as_string": "46970",

"doc_count": 271,

"score": 256.549815498155,

"bg_count": 271

{

"key": 52245,

"key_as_string": "52245",

"doc_count": 59,

"score": 17.66462367106966,

"bg_count": 185

{

"key": 8641,

"key_as_string": "8641",

"doc_count": 107,

"score": 13.884387742677438,

"bg_count": 762

{

"key": 58156,

"key_as_string": "58156",

"doc_count": 17,

"score": 9.746428133759462,

"bg_count": 28

{

"key": 52973,

"key_as_string": "52973",

"doc_count": 95,

"score": 9.65770100311672,

"bg_count": 857

{

"key": 35836,

"key_as_string": "35836",

"doc_count": 128,

"score": 9.199001116457955,

"bg_count": 1610

}

]

...

The top-level doc_count shows the number of docs in the foreground group.

Each bucket lists the key (for example, movie ID) being aggregated.

A doc_count for that bucket.

And a background count, which shows the rate at which this value appears in the entire background.

You can see that the first bucket we get back is Talladega Nights. It is found in all 271 documents, which is not surprising. Let’s look at the next bucket: key 52245.

This ID corresponds to Blades of Glory, a comedy about male figure skating that also stars Will Ferrell. We can see that it was recommended 59 times by the people who also liked Talladega Nights. This means that 21% of the foreground group recommended Blades of Glory (59 / 271 = 0.2177).

In contrast, Blades of Glory was recommended only 185 times in the entire dataset, which equates to a mere 0.26% (185 / 69796 = 0.00265). Blades of Glory is therefore a statistical anomaly: it is uncommonly common in the group of people who like Talladega Nights. We just found a good recommendation!

If we look at the entire list, they are all comedies that would fit as good recommendations (many of which also star Will Ferrell):

1. Blades of Glory

2. Anchorman: The Legend of Ron Burgundy

3. Semi-Pro

4. Knocked Up

5. 40-Year-Old Virgin, The

This is just one example of the power of significant_terms. Once you start using significant_terms, you find many situations where you don’t want the most popular—you want the most uncommonly common. This simple aggregation can uncover some surprisingly sophisticated trends in your data.