Elasticsearch: The Definitive Guide (2015)

Part V. Geolocation

Chapter 38. Geo-aggregations

Although filtering or scoring results by geolocation is useful, it is often more useful to be able to present information to the user on a map. A search may return way too many results to be able to display each geo-point individually, but geo-aggregations can be used to cluster geo-points into more manageable buckets.

Three aggregations work with fields of type geo_point:

geo_distance

Groups documents into concentric circles around a central point.

geohash_grid

Groups documents by geohash cell, for display on a map.

geo_bounds

Returns the lat/lon coordinates of a bounding box that would encompass all of the geo-points. This is useful for choosing the correct zoom level when displaying a map.

geo_distance Aggregation

The geo_distance agg is useful for searches such as to “find all pizza restaurants within 1km of me.” The search results should, indeed, be limited to the 1km radius specified by the user, but we can add “another result found within 2km”:

GET /attractions/restaurant/_search

{

"query": {

"filtered": {

"query": {

"match": {

"name": "pizza"

}

"filter": {

"geo_bounding_box": {

"location": {

"top_left": {

"lat": 40,8,

"lon": -74.1

"bottom_right": {

"lat": 40.4,

"lon": -73.7

}

"aggs": {

"per_ring": {

"geo_distance": {

"field": "location",

"unit": "km",

"origin": {

"lat": 40.712,

"lon": -73.988

"ranges": [

{ "from": 0, "to": 1 },

{ "from": 1, "to": 2 }

]

}

"post_filter": {

"geo_distance": {

"distance": "1km",

"location": {

"lat": 40.712,

"lon": -73.988

}

The main query looks for restaurants with pizza in the name.

The bounding box filters these results down to just those in the greater New York area.

The geo_distance agg counts the number of results within 1km of the user, and between 1km and 2km from the user.

Finally, the post_filter reduces the search results to just those restaurants within 1km of the user.

The response from the preceding request is as follows:

"hits": {

"total": 1,

"max_score": 0.15342641,

"hits": [

{

"_index": "attractions",

"_type": "restaurant",

"_id": "3",

"_score": 0.15342641,

"_source": {

"name": "Mini Munchies Pizza",

"location": [

-73.983,

40.719

]

}

]

"aggregations": {

"per_ring": {

"buckets": [

{

"key": "*-1.0",

"from": 0,

"to": 1,

"doc_count": 1

{

"key": "1.0-2.0",

"from": 1,

"to": 2,

"doc_count": 1

}

]

}

The post_filter has reduced the search hits to just the single pizza restaurant within 1km of the user.

The aggregation includes the search result plus the other pizza restaurant within 2km of the user.

In this example, we have counted the number of restaurants that fall into each concentric ring. Of course, we could nest subaggregations under the per_rings aggregation to calculate the average price per ring, the maximium popularity, and more.

geohash_grid Aggregation

The number of results returned by a query may be far too many to display each geo-point individually on a map. The geohash_grid aggregation buckets nearby geo-points together by calculating the geohash for each point, at the level of precision that you define.

The result is a grid of cells—one cell per geohash—that can be displayed on a map. By changing the precision of the geohash, you can summarize information across the whole world, by country, or by city block.

The aggregation is sparse—it returns only cells that contain documents. If your geohashes are too precise and too many buckets are generated, it will return, by default, the 10,000 most populous cells—those containing the most documents. However, it still needs to generate all the buckets in order to figure out which are the most populous 10,000. You need to control the number of buckets generated by doing the following:

1. Limit the result with a geo_bounding_box filter.

2. Choose an appropriate precision for the size of your bounding box.

GET /attractions/restaurant/_search?search_type=count

{

"query": {

"filtered": {

"filter": {

"geo_bounding_box": {

"location": {

"top_left": {

"lat": 40,8,

"lon": -74.1

"bottom_right": {

"lat": 40.4,

"lon": -73.7

}

"aggs": {

"new_york": {

"geohash_grid": {

"field": "location",

"precision": 5

}

The bounding box limits the scope of the search to the greater New York area.

Geohashes of precision 5 are approximately 5km x 5km.

Geohashes with precision 5 measure about 25km² each, so 10,000 cells at this precision would cover 250,000km². The bounding box that we specified measures approximately 44km x 33km, or about 1,452km², so we are well within safe limits; we definitely won’t create too many buckets in memory.

The response from the preceding request looks like this:

...

"aggregations": {

"new_york": {

"buckets": [

{

"key": "dr5rs",

"doc_count": 2

{

"key": "dr5re",

"doc_count": 1

}

]

}

...

Each bucket contains the geohash as the key.

Again, we didn’t specify any subaggregations, so all we got back was the document count. We could have asked for popular restaurant types, average price, or other details.

TIP

To plot these buckets on a map, you need a library that understands how to convert a geohash into the equivalent bounding box or central point. Libraries exist in JavaScript and other languages that will perform this conversion for you, but you can also use information from “geo_bounds Aggregation” to perform a similar job.

geo_bounds Aggregation

In our previous example, we filtered our results by using a bounding box that covered the greater New York area. However, our results were all located in downtown Manhattan. When displaying a map for our user, it makes sense to zoom into the area of the map that contains the data; there is no point in showing lots of empty space.

The geo_bounds aggregation does exactly this: it calculates the smallest bounding box that is needed to encapsulate all of the geo-points:

GET /attractions/restaurant/_search?search_type=count

{

"query": {

"filtered": {

"filter": {

"geo_bounding_box": {

"location": {

"top_left": {

"lat": 40,8,

"lon": -74.1

"bottom_right": {

"lat": 40.4,

"lon": -73.9

}

"aggs": {

"new_york": {

"geohash_grid": {

"field": "location",

"precision": 5

}

"map_zoom": {

"geo_bounds": {

"field": "location"

}

The geo_bounds aggregation will calculate the smallest bounding box required to encapsulate all of the documents matching our query.

The response now includes a bounding box that we can use to zoom our map:

...

"aggregations": {

"map_zoom": {

"bounds": {

"top_left": {

"lat": 40.722,

"lon": -74.011

"bottom_right": {

"lat": 40.715,

"lon": -73.983

}

...

In fact, we could even use the geo_bounds aggregation inside each geohash cell, in case the geo-points inside a cell are clustered in just a part of the cell:

GET /attractions/restaurant/_search?search_type=count

{

"query": {

"filtered": {

"filter": {

"geo_bounding_box": {

"location": {

"top_left": {

"lat": 40,8,

"lon": -74.1

"bottom_right": {

"lat": 40.4,

"lon": -73.9

}

"aggs": {

"new_york": {

"geohash_grid": {

"field": "location",

"precision": 5

"aggs": {

"cell": {

"geo_bounds": {

"field": "location"

}

The cell_bounds subaggregation is calculated for every geohash cell.

Now the points in each cell have a bounding box:

...

"aggregations": {

"new_york": {

"buckets": [

{

"key": "dr5rs",

"doc_count": 2,

"cell": {

"bounds": {

"top_left": {

"lat": 40.722,

"lon": -73.989

"bottom_right": {

"lat": 40.719,

"lon": -73.983

}

...