High-Level Concepts - Aggregations - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part IV. Aggregations

Until this point, this book has been dedicated to search. With search, we have a query and we want to find a subset of documents that match the query. We are looking for the proverbial needle(s) in the haystack.

With aggregations, we zoom out to get an overview of our data. Instead of looking for individual documents, we want to analyze and summarize our complete set of data:

§ How many needles are in the haystack?

§ What is the average length of the needles?

§ What is the median length of the needles, broken down by manufacturer?

§ How many needles were added to the haystack each month?

Aggregations can answer more subtle questions too:

§ What are your most popular needle manufacturers?

§ Are there any unusual or anomalous clumps of needles?

Aggregations allow us to ask sophisticated questions of our data. And yet, while the functionality is completely different from search, it leverages the same data-structures. This means aggregations execute quickly and are near real-time, just like search.

This is extremely powerful for reporting and dashboards. Instead of performing rollups of your data (that crusty Hadoop job that takes a week to run), you can visualize your data in real time, allowing you to respond immediately. Your report changes as your data changes, rather than being pre-calculated, out of date and irrelevant.

Finally, aggregations operate alongside search requests. This means you can both search/filter documents and perform analytics at the same time, on the same data, in a single request. And because aggregations are calculated in the context of a user’s search, you’re not just displaying a count of four-star hotels—you’re displaying a count of four-star hotels that match their search criteria.

Aggregations are so powerful that many companies have built large Elasticsearch clusters solely for analytics.

Chapter 25. High-Level Concepts

Like the query DSL, aggregations have a composable syntax: independent units of functionality can be mixed and matched to provide the custom behavior that you need. This means that there are only a few basic concepts to learn, but nearly limitless combinations of those basic components.

To master aggregations, you need to understand only two main concepts:

Buckets

Collections of documents that meet a criterion

Metrics

Statistics calculated on the documents in a bucket

That’s it! Every aggregation is simply a combination of one or more buckets and zero or more metrics. To translate into rough SQL terms:

SELECT COUNT(color) 1

FROM table

GROUP BY color 2

1

COUNT(color) is equivalent to a metric.

2

GROUP BY color is equivalent to a bucket.

Buckets are conceptually similar to grouping in SQL, while metrics are similar to COUNT(), SUM(), MAX(), and so forth.

Let’s dig into both of these concepts and see what they entail.

Buckets

A bucket is simply a collection of documents that meet a certain criteria:

§ An employee would land in either the male or female bucket.

§ The city of Albany would land in the New York state bucket.

§ The date 2014-10-28 would land within the October bucket.

As aggregations are executed, the values inside each document are evaluated to determine whether they match a bucket’s criteria. If they match, the document is placed inside the bucket and the aggregation continues.

Buckets can also be nested inside other buckets, giving you a hierarchy or conditional partitioning scheme. For example, Cincinnati would be placed inside the Ohio state bucket, and the entire Ohio bucket would be placed inside the USA country bucket.

Elasticsearch has a variety of buckets, which allow you to partition documents in many ways (by hour, by most-popular terms, by age ranges, by geographical location, and more). But fundamentally they all operate on the same principle: partitioning documents based on a criteria.

Metrics

Buckets allow us to partition documents into useful subsets, but ultimately what we want is some kind of metric calculated on those documents in each bucket. Bucketing is the means to an end: it provides a way to group documents in a way that you can calculate interesting metrics.

Most metrics are simple mathematical operations (for example, min, mean, max, and sum) that are calculated using the document values. In practical terms, metrics allow you to calculate quantities such as the average salary, or the maximum sale price, or the 95th percentile for query latency.

Combining the Two

An aggregation is a combination of buckets and metrics. An aggregation may have a single bucket, or a single metric, or one of each. It may even have multiple buckets nested inside other buckets. For example, we can partition documents by which country they belong to (a bucket), and then calculate the average salary per country (a metric).

Because buckets can be nested, we can derive a much more complex aggregation:

1. Partition documents by country (bucket).

2. Then partition each country bucket by gender (bucket).

3. Then partition each gender bucket by age ranges (bucket).

4. Finally, calculate the average salary for each age range (metric)

This will give you the average salary per <country, gender, age> combination. All in one request and with one pass over the data!