Geo-Points - Geolocation - Elasticsearch: The Definitive Guide (2015)

Elasticsearch: The Definitive Guide (2015)

Part V. Geolocation

Gone are the days when we wander around a city with paper maps. Thanks to smartphones, we now know exactly where we are all the time, and we expect websites to use that information. I’m not interested in restaurants in Greater London—I want to know about restaurants within a 5-minute walk of my current location.

But geolocation is only one part of the puzzle. The beauty of Elasticsearch is that it allows you to combine geolocation with full-text search, structured search, and analytics.

For instance: show me restaurants that mention vitello tonnato, are within a 5-minute walk, and are open at 11 p.m., and then rank them by a combination of user rating, distance, and price. Another example: show me a map of vacation rental properties available in August throughout the city, and calculate the average price per zone.

Elasticsearch offers two ways of representing geolocations: latitude-longitude points using the geo_point field type, and complex shapes defined in GeoJSON, using the geo_shape field type.

Geo-points allow you to find points within a certain distance of another point, to calculate distances between two points for sorting or relevance scoring, or to aggregate into a grid to display on a map. Geo-shapes, on the other hand, are used purely for filtering. They can be used to decide whether two shapes overlap, or whether one shape completely contains other shapes.

Chapter 36. Geo-Points

A geo-point is a single latitude/longitude point on the Earth’s surface. Geo-points can be used to calculate distance from a point, to determine whether a point falls within a bounding box, or in aggregations.

Geo-points cannot be automatically detected with dynamic mapping. Instead, geo_point fields should be mapped explicitly:

PUT /attractions

{

"mappings": {

"restaurant": {

"properties": {

"name": {

"type": "string"

},

"location": {

"type": "geo_point"

}

}

}

}

}

Lat/Lon Formats

With the location field defined as a geo_point, we can proceed to index documents containing latitude/longitude pairs, which can be formatted as strings, arrays, or objects:

PUT /attractions/restaurant/1

{

"name": "Chipotle Mexican Grill",

"location": "40.715, -74.011" 1

}

PUT /attractions/restaurant/2

{

"name": "Pala Pizza",

"location": { 2

"lat": 40.722,

"lon": -73.989

}

}

PUT /attractions/restaurant/3

{

"name": "Mini Munchies Pizza",

"location": [ -73.983, 40.719 ] 3

}

1

A string representation, with "lat,lon".

2

An object representation with lat and lon explicitly named.

3

An array representation with [lon,lat].

Caution

Everybody gets caught at least once: string geo-points are "latitude,longitude", while array geo-points are [longitude,latitude]—the opposite order!

Originally, both strings and arrays in Elasticsearch used latitude followed by longitude. However, it was decided early on to switch the order for arrays in order to conform with GeoJSON.

The result is a bear trap that captures all unsuspecting users on their journey to full geolocation nirvana.

Filtering by Geo-Point

Four geo-point filters can be used to include or exclude documents by geolocation:

geo_bounding_box

Find geo-points that fall within the specified rectangle.

geo_distance

Find geo-points within the specified distance of a central point.

geo_distance_range

Find geo-points within a specified minimum and maximum distance from a central point.

geo_polygon

Find geo-points that fall within the specified polygon. This filter is very expensive. If you find yourself wanting to use it, you should be looking at geo-shapes instead.

All of these filters work in a similar way: the lat/lon values are loaded into memory for all documents in the index, not just the documents that match the query (see “Fielddata”). Each filter performs a slightly different calculation to check whether a point falls into the containing area.

TIP

Geo-filters are expensive — they should be used on as few documents as possible. First remove as many documents as you can with cheaper filters, like term or range filters, and apply the geo-filters last.

The bool filter will do this for you automatically. First it applies any bitset-based filters (see “All About Caching”) to exclude as many documents as it can as cheaply as possible. Then it applies the more expensive geo or script filters to each remaining document in turn.

geo_bounding_box Filter

This is by far the most efficient geo-filter because its calculation is very simple. You provide it with the top, bottom, left, and right coordinates of a rectangle, and all it does is compare the latitude with the left and right coordinates, and the longitude with the top and bottom coordinates:

GET /attractions/restaurant/_search

{

"query": {

"filtered": {

"filter": {

"geo_bounding_box": {

"location": { 1

"top_left": {

"lat": 40.8,

"lon": -74.0

},

"bottom_right": {

"lat": 40.7,

"lon": -73.0

}

}

}

}

}

}

}

1

These coordinates can also be specified as bottom_left and top_right.

Optimizing Bounding Boxes

The geo_bounding_box is the one geo-filter that doesn’t require all geo-points to be loaded into memory. Because all it has to do is check whether the lat and lon values fall within the specified ranges, it can use the inverted index to do a glorified range filter.

To use this optimization, the geo_point field must be mapped to index the lat and lon values separately:

PUT /attractions

{

"mappings": {

"restaurant": {

"properties": {

"name": {

"type": "string"

},

"location": {

"type": "geo_point",

"lat_lon": true 1

}

}

}

}

}

1

The location.lat and location.lon fields will be indexed separately. These fields can be used for searching, but their values cannot be retrieved.

Now, when we run our query, we have to tell Elasticsearch to use the indexed lat and lon values:

GET /attractions/restaurant/_search

{

"query": {

"filtered": {

"filter": {

"geo_bounding_box": {

"type": "indexed", 1

"location": {

"top_left": {

"lat": 40.8,

"lon": -74.0

},

"bottom_right": {

"lat": 40.7,

"lon": -73.0

}

}

}

}

}

}

}

1

Setting the type parameter to indexed (instead of the default memory) tells Elasticsearch to use the inverted index for this filter.

Caution

While a geo_point field can contain multiple geo-points, the lat_lon optimization can be used only on fields that contain a single geo-point.

geo_distance Filter

The geo_distance filter draws a circle around the specified location and finds all documents that have a geo-point within that circle:

GET /attractions/restaurant/_search

{

"query": {

"filtered": {

"filter": {

"geo_distance": {

"distance": "1km", 1

"location": { 2

"lat": 40.715,

"lon": -73.988

}

}

}

}

}

}

1

Find all location fields within 1km of the specified point. See Distance Units for a list of the accepted units.

2

The central point can be specified as a string, an array, or (as in this example) an object. See “Lat/Lon Formats”.

A geo-distance calculation is expensive. To optimize performance, Elasticsearch draws a box around the circle and first uses the less expensive bounding-box calculation to exclude as many documents as it can. It runs the geo-distance calculation on only those points that fall within the bounding box.

TIP

Do your users really require an accurate circular filter to be applied to their results? Using a rectangular bounding box is much more efficient than geo-distance and will usually serve their purposes just as well.

Faster Geo-Distance Calculations

The distance between two points can be calculated using algorithms, which trade performance for accuracy:

arc

The slowest but most accurate is the arc calculation, which treats the world as a sphere. Accuracy is still limited because the world isn’t really a sphere.

plane

The plane calculation, which treats the world as if it were flat, is faster but less accurate. It is most accurate at the equator and becomes less accurate toward the poles.

sloppy_arc

So called because it uses the SloppyMath Lucene class to trade accuracy for speed, the sloppy_arc calculation uses the Haversine formula to calculate distance. It is four to five times as fast as arc, and distances are 99.9% accurate. This is the default calculation.

You can specify a different calculation as follows:

GET /attractions/restaurant/_search

{

"query": {

"filtered": {

"filter": {

"geo_distance": {

"distance": "1km",

"distance_type": "plane", 1

"location": {

"lat": 40.715,

"lon": -73.988

}

}

}

}

}

}

1

Use the faster but less accurate plane calculation.

TIP

Will your users really care if a restaurant is a few meters outside their specified radius? While some geo applications require great accuracy, less-accurate but faster calculations will suit the majority of use cases just fine.

geo_distance_range Filter

The only difference between the geo_distance and geo_distance_range filters is that the latter has a doughnut shape and excludes documents within the central hole.

Instead of specifying a single distance from the center, you specify a minimum distance (with gt or gte) and maximum distance (with lt or lte), just like a range filter:

GET /attractions/restaurant/_search

{

"query": {

"filtered": {

"filter": {

"geo_distance_range": {

"gte": "1km", 1

"lt": "2km", 1

"location": {

"lat": 40.715,

"lon": -73.988

}

}

}

}

}

}

1

Matches locations that are at least 1km from the center, and less than 2km from the center.

Caching geo-filters

The results of geo-filters are not cached by default, for two reasons:

§ Geo-filters are usually used to find entities that are near to a user’s current location. The problem is that users move, and no two users are in exactly the same location. A cached filter would have little chance of being reused.

§ Filters are cached as bitsets that represent all documents in a segment. Imagine that our query excludes all documents but one in a particular segment. An uncached geo-filter just needs to check the one remaining document, but a cached geo-filter would need to check all of the documents in the segment.

That said, caching can be used to good effect with geo-filters. Imagine that your index contains restaurants from all over the United States. A user in New York is not interested in restaurants in San Francisco. We can treat New York as a hot spot and draw a big bounding box around the city and neighboring areas.

This geo_bounding_box filter can be cached and reused whenever we have a user within the city limits of New York. It will exclude all restaurants from the rest of the country. We can then use an uncached, more specific geo_bounding_box or geo_distance filter to narrow the remaining results to those that are close to the user:

GET /attractions/restaurant/_search

{

"query": {

"filtered": {

"filter": {

"bool": {

"must": [

{

"geo_bounding_box": {

"type": "indexed",

"_cache": true, 1

"location": {

"top_left": {

"lat": 40,8,

"lon": -74.1

},

"bottom_right": {

"lat": 40.4,

"lon": -73.7

}

}

}

},

{

"geo_distance": { 2

"distance": "1km",

"location": {

"lat": 40.715,

"lon": -73.988

}

}

}

]

}

}

}

}

}

1

The cached bounding box filter reduces all results down to those in the greater New York area.

2

The more costly geo_distance filter narrows the results to those within 1km of the user.

Reducing Memory Usage

Each lat/lon pair requires 16 bytes of memory, memory that is in short supply. It needs this much memory in order to provide very accurate results. But as we have commented before, such exacting precision is seldom required.

You can reduce the amount of memory that is used by switching to a compressed fielddata format and by specifying how precise you need your geo-points to be. Even reducing precision to 1mm reduces memory usage by a third. A more realistic setting of 3m reduces usage by 62%, and 1kmsaves a massive 75%!

This setting can be changed on a live index with the update-mapping API:

POST /attractions/_mapping/restaurant

{

"location": {

"type": "geo_point",

"fielddata": {

"format": "compressed",

"precision": "1km" 1

}

}

}

1

Each lat/lon pair will require only 4 bytes, instead of 16.

Alternatively, you can avoid using memory for geo-points altogether, either by using the technique described in “Optimizing Bounding Boxes”, or by storing geo-points as doc values:

PUT /attractions

{

"mappings": {

"restaurant": {

"properties": {

"name": {

"type": "string"

},

"location": {

"type": "geo_point",

"doc_values": true 1

}

}

}

}

}

1

Geo-points will not be loaded into memory, but instead stored on disk.

Mapping a geo-point to use doc values can be done only when the field is first created. There is a small performance cost in using doc values instead of fielddata, but with memory in such short supply, it is often worth doing.

Sorting by Distance

Search results can be sorted by distance from a point:

TIP

While you can sort by distance, “Scoring by Distance” is usually a better solution.

GET /attractions/restaurant/_search

{

"query": {

"filtered": {

"filter": {

"geo_bounding_box": {

"type": "indexed",

"location": {

"top_left": {

"lat": 40,8,

"lon": -74.0

},

"bottom_right": {

"lat": 40.4,

"lon": -73.0

}

}

}

}

}

},

"sort": [

{

"_geo_distance": {

"location": { 1

"lat": 40.715,

"lon": -73.998

},

"order": "asc",

"unit": "km", 2

"distance_type": "plane" 3

}

}

]

}

1

Calculate the distance between the specified lat/lon point and the geo-point in the location field of each document.

2

Return the distance in km in the sort keys for each result.

3

Use the faster but less accurate plane calculation.

You may ask yourself: why do we specify the distance unit? For sorting, it doesn’t matter whether we compare distances in miles, kilometers, or light years. The reason is that the actual value used for sorting is returned with each result, in the sort element:

...

"hits": [

{

"_index": "attractions",

"_type": "restaurant",

"_id": "2",

"_score": null,

"_source": {

"name": "New Malaysia",

"location": {

"lat": 40.715,

"lon": -73.997

}

},

"sort": [

0.08425653647614346 1

]

},

...

1

This restaurant is 0.084km from the location we specified.

You can set the unit to return these values in whatever form makes sense for your application.

TIP

Geo-distance sorting can also handle multiple geo-points, both in the document and in the sort parameters. Use the sort_mode to specify whether it should use the min, max, or avg distance between each combination of locations. This can be used to return “friends nearest to my work and home locations.”

Scoring by Distance

It may be that distance is the only important factor in deciding the order in which results are returned, but more frequently we need to combine distance with other factors, such as full-text relevance, popularity, and price.

In these situations, we should reach for the function_score query that allows us to blend all of these factors into an overall score. See “The Closer, The Better” for an example that uses geo-distance to influence scoring.

The other drawback of sorting by distance is performance: the distance has to be calculated for all matching documents. The function_score query, on the other hand, can be executed during the rescore phase, limiting the number of calculations to just the top n results.