Elasticsearch: The Definitive Guide (2015)
Part IV. Aggregations
Chapter 26. Aggregation Test-Drive
We could spend the next few pages defining the various aggregations and their syntax, but aggregations are truly best learned by example. Once you learn how to think about aggregations, and how to nest them appropriately, the syntax is fairly trivial.
NOTE
A complete list of aggregation buckets and metrics can be found at the online reference documentation. We’ll cover many of them in this chapter, but glance over it after finishing so you are familiar with the full range of capabilities.
So let’s just dive in and start with an example. We are going to build some aggregations that might be useful to a car dealer. Our data will be about car transactions: the car model, manufacturer, sale price, when it sold, and more.
First we will bulk-index some data to work with:
POST /cars/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }
Now that we have some data, let’s construct our first aggregation. A car dealer may want to know which color car sells the best. This is easily accomplished using a simple aggregation. We will do this using a terms bucket:
GET /cars/transactions/_search?search_type=count
{
"aggs" : {
"colors" : {
"terms" : {
"field" : "color"
}
}
}
}
Aggregations are placed under the top-level aggs parameter (the longer aggregations will also work if you prefer that).
We then name the aggregation whatever we want: colors, in this example
Finally, we define a single bucket of type terms.
Aggregations are executed in the context of search results, which means it is just another top-level parameter in a search request (for example, using the /_search endpoint). Aggregations can be paired with queries, but we’ll tackle that later in Chapter 29.
NOTE
You’ll notice that we used the count search_type. Because we don’t care about search results—the aggregation totals—the count search_type will be faster because it omits the fetch phase.
Next we define a name for our aggregation. Naming is up to you; the response will be labeled with the name you provide so that your application can parse the results later.
Next we define the aggregation itself. For this example, we are defining a single terms bucket. The terms bucket will dynamically create a new bucket for every unique term it encounters. Since we are telling it to use the color field, the terms bucket will dynamically create a new bucket for each color.
Let’s execute that aggregation and take a look at the results:
{
...
"hits": {
"hits": []
},
"aggregations": {
"colors": {
"buckets": [
{
"key": "red",
"doc_count": 4
},
{
"key": "blue",
"doc_count": 2
},
{
"key": "green",
"doc_count": 2
}
]
}
}
}
No search hits are returned because we used the search_type=count parameter
Our colors aggregation is returned as part of the aggregations field.
The key to each bucket corresponds to a unique term found in the color field. It also always includes doc_count, which tells us the number of docs containing the term.
The count of each bucket represents the number of documents with this color.
The response contains a list of buckets, each corresponding to a unique color (for example, red or green). Each bucket also includes a count of the number of documents that “fell into” that particular bucket. For example, there are four red cars.
The preceding example is operating entirely in real time: if the documents are searchable, they can be aggregated. This means you can take the aggregation results and pipe them straight into a graphing library to generate real-time dashboards. As soon as you sell a silver car, your graphs would dynamically update to include statistics about silver cars.
Voila! Your first aggregation!
Adding a Metric to the Mix
The previous example told us the number of documents in each bucket, which is useful. But often, our applications require more-sophisticated metrics about the documents. For example, what is the average price of cars in each bucket?
To get this information, we need to tell Elasticsearch which metrics to calculate, and on which fields. This requires nesting metrics inside the buckets. Metrics will calculate mathematical statistics based on the values of documents within a bucket.
Let’s go ahead and add an average metric to our car example:
GET /cars/transactions/_search?search_type=count
{
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
We add a new aggs level to hold the metric.
We then give the metric a name: avg_price.
And finally, we define it as an avg metric over the price field.
As you can see, we took the previous example and tacked on a new aggs level. This new aggregation level allows us to nest the avg metric inside the terms bucket. Effectively, this means we will generate an average for each color.
Just like the colors example, we need to name our metric (avg_price) so we can retrieve the values later. Finally, we specify the metric itself (avg) and what field we want the average to be calculated on (price):
{
...
"aggregations": {
"colors": {
"buckets": [
{
"key": "red",
"doc_count": 4,
"avg_price": {
"value": 32500
}
},
{
"key": "blue",
"doc_count": 2,
"avg_price": {
"value": 20000
}
},
{
"key": "green",
"doc_count": 2,
"avg_price": {
"value": 21000
}
}
]
}
}
...
}
New avg_price element in response
Although the response has changed minimally, the data we get out of it has grown substantially. Before, we knew there were four red cars. Now we know that the average price of red cars is $32,500. This is something that you can plug directly into reports or graphs.
Buckets Inside Buckets
The true power of aggregations becomes apparent once you start playing with different nesting schemes. In the previous examples, we saw how you could nest a metric inside a bucket, which is already quite powerful.
But the real exciting analytics come from nesting buckets inside other buckets. This time, we want to find out the distribution of car manufacturers for each color:
GET /cars/transactions/_search?search_type=count
{
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
},
"make": {
"terms": {
"field": "make"
}
}
}
}
}
}
Notice that we can leave the previous avg_price metric in place.
Another aggregation named make is added to the color bucket.
This aggregation is a terms bucket and will generate unique buckets for each car make.
A few interesting things happened here. First, you’ll notice that the previous avg_price metric is left entirely intact. Each level of an aggregation can have many metrics or buckets. The avg_price metric tells us the average price for each car color. This is independent of other buckets and metrics that are also being built.
This is important for your application, since there are often many related, but entirely distinct, metrics that you need to collect. Aggregations allow you to collect all of them in a single pass over the data.
The other important thing to note is that the aggregation we added, make, is a terms bucket (nested inside the colors terms bucket). This means we will generate a (color, make) tuple for every unique combination in your dataset.
Let’s take a look at the response (truncated for brevity, since it is now growing quite long):
{
...
"aggregations": {
"colors": {
"buckets": [
{
"key": "red",
"doc_count": 4,
"make": {
"buckets": [
{
"key": "honda",
"doc_count": 3
},
{
"key": "bmw",
"doc_count": 1
}
]
},
"avg_price": {
"value": 32500
}
},
...
}
Our new aggregation is nested under each color bucket, as expected.
We now see a breakdown of car makes for each color.
Finally, you can see that our previous avg_price metric is still intact.
The response tells us the following:
§ There are four red cars.
§ The average price of a red car is $32,500.
§ Three of the red cars are made by Honda, and one is a BMW.
One Final Modification
Just to drive the point home, let’s make one final modification to our example before moving on to new topics. Let’s add two metrics to calculate the min and max price for each make:
GET /cars/transactions/_search?search_type=count
{
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"avg_price": { "avg": { "field": "price" }
},
"make" : {
"terms" : {
"field" : "make"
},
"aggs" : {
"min_price" : { "min": { "field": "price"} },
"max_price" : { "max": { "field": "price"} }
}
}
}
}
}
}
We need to add another aggs level for nesting.
Then we include a min metric.
And a max metric.
Which gives us the following output (again, truncated):
{
...
"aggregations": {
"colors": {
"buckets": [
{
"key": "red",
"doc_count": 4,
"make": {
"buckets": [
{
"key": "honda",
"doc_count": 3,
"min_price": {
"value": 10000
},
"max_price": {
"value": 20000
}
},
{
"key": "bmw",
"doc_count": 1,
"min_price": {
"value": 80000
},
"max_price": {
"value": 80000
}
}
]
},
"avg_price": {
"value": 32500
}
},
...
The min and max metrics that we added now appear under each make
With those two buckets, we’ve expanded the information derived from this query to include the following:
§ There are four red cars.
§ The average price of a red car is $32,500.
§ Three of the red cars are made by Honda, and one is a BMW.
§ The cheapest red Honda is $10,000.
§ The most expensive red Honda is $20,000.