In this article series I will describe some essential techniques for querying geospatial data in MongoDB, which can be useful if you want your app or API to provide access to information ordered based on distance from some specific location. For example:
- businesses ( restaurants, shops etc. ) or other places of interest,
- other users of the app ( as in the case of dating apps ).
The techniques will allow your service to scale and remain efficient as they enable constant time access to the data (regardless of the amount of data in your database) and minimise caching required on the client side.
What you will learn
In the following sections I will show you how to:
- store location data ( longitude, latitude pairs ) in MongoDB documents
- query such documents using the location data, with results sorted by distance from a specified point ( from closest to furthest )
- efficiently page through results of such queries
You should be able to follow this tutorial even if you haven’t used MongoDB before. On the other hand if the first 2 points sound familiar, you might want to skip straight to section 3.
If you want to run the code examples locally, clone the accompanying repo, and follow the instructions in the
Storing location data.
doc above is what you'd insert into a mongo collection, it has a field called
location which value is a GeoJSON Point object. The name of the field is immaterial, we could have used any other valid key name instead of
location. We could also nest the GeoJSON deeper into the object structure, however putting
coordinates at the top level of
doc wouldn't work. It's also possible to store multiple GeoJSON objects in one document, for example:
- longitude comes first ( this is a reverse to what you might be used to from, for example, google maps queries )
- longitude values need to be between -180 and 180 ( both inclusive )
- latitude values need to be between -90 and 90 ( both inclusive )
Now that you know the basics, lets generate some data to work with.
We’ve created 6 documents, with locations starting at the equator, increasing the latitude in equal intervals, while keeping longitude fixed at 0. Here are the points plotted on a sphere:
Note that in the following code examples we will omit the boilerplate required to obtain a mongodb collection object and insert documents into it. In the accompanying repo, this boilerplate has been contained in the
with_collection_with_points helper functions.
Querying for documents based on location.
To query documents based on their distance from a specific point we are going to use the
$near query operator. But before we do this, we need to create a 2dsphere index on the field containing the GeoJSON Point objects.
Setting the background option is very important when creating an index on a live database, since by default createIndex will block all other operations on the database while the index is being created ( which might take a while if the collection is big ).
Now the basic query, which returns all documents sorted by distance ( great-circle distance to be precise ) from point
[ 0, 0 ], based on data at
location field in the documents:
Unless your intention is to process all documents in the collection ( in which case you’ll probably call
.stream instead of
.toArray ), you'll want to limit, the number of returned documents. Here's how its done:
On the illustration below, the white point marks the location used in the above query (
[ 0, 0 ] ) and the points circled in green represents the results of the query with
limit set to
Paging through results, from closest to furthest (forwards).
Note: the technique discussed in this section has been previously described by A. Jesse Jiryu Davis, who implemented the MondoDB feature which makes this technique possible. His article goes into details of why this method is performant, so its worth a read, however code examples are in python, therefore we’ll go through it step by step here for the benefit of the Node.js community.
Building on the query we defined in the previous section, the simplest way to implement paging would be to use
limit to control the page/batch size, and the
skip method to set the desired page offset. For example:
The above query will return the 2nd page of results when querying our test data, as illustrated below.
However the performance of
skip decreases linearly as the offset increases ( as demonstrated in the article mentioned above ), since the MongoDB server needs to scan through all the query results from the beginning until the offset is reached.
Assuming that we know the distance between the query point and the furthest document from a given page of results, we could query for the next page as follows:
But where do we get the distance from? We could try to calculate it using a hand rolled implementation of a formula for a distance between two points on a sphere. Or use something like
getDistance from Geolib. However we'd have to make sure that our chosen implementation matches the implementation used by MongoDB. Fortunately we don't have to go through all that hassle since we can ask MongoDB to attach a dynamically calculated distance to each document in the query results. We'll just have to convert our
find query into an equivalent aggregation pipeline, using $geoNear stage with its
Now we can use the
calculated_distance property of the last document from the query results ( since they are sorted by distance, in ascending order ), to fetch the next page. Lets use the above function to fetch first two pages.
And here is a visualisation of the results of the second call to
fetch_page, with the yellow circle enclosing the area excluded from the query by using
However, this is not exactly what we want: the last document of the previous page is included in the next page, since
minDistance only excludes documents, which distance is smaller then given value. To prevent this, we need to add the following query to our
$geoNear aggregation stage.
Notice that the 2nd and 3rd points lay at exactly the same distance from point
[ 0, 0 ]. Now, if we set the
[ 0, 0 ] and
3, what would the result of fetching the second page be? Assuming that the points returned by the first query were ordered as in the array above (
[ 0, -15 ] last ), here is how it would look like.
This is because neither
$nin exclude the document with coordinates
[ 0, 15 ]. Therefore instead of just using the
_id of the last document, we need to collect
_ids of all documents which distance is equal to distance of
last_doc. Putting it all together:
current_page here's how you would fetch the next one:
Note that the logic in
get_ids_to_exclude will most likely be executed on the client, hence it was not included in the
fetch_page function, which expects precomputed
last_distance to be passed in. This is to minimize the amount of data sent over the wire. Alternatively the server could include a HTTP Link Header with all information necessary to fetch the next page, in which case all of the above code would be executed on the server.
Finally we need to handle a case where there are so many documents with the same distance, that the last document in a page ends up having the same distance as the
last_distance used to fetch that page. For example, when fetching 3rd page from the following set of points ( with
query_point = [ 0, 0 ] and
page_size = 2 ):
Using the above implementation we would get points
[ [ 0, -15 ], [ 0, 30 ] ] instead of desired
[ [ 0, 30 ], [ 0, 45 ] ], since
_id of the last point from page 1 (
[ 0, -15 ] ), is not excluded even though its distance is the same as
minDistance. Therefore in such cases, we need to carry
exclude_ids from previous query to the next.
It is worth noting that in the extreme case off all documents having the same distance, the size of
exclude_ids array, and hence the amount of data sent over the wire on each request, would increase linearly as we progress through the pages. On top of this, query performance itself would degrade in similar fashion ( and my bet is that it would be worse then the naive implementation relying solely on
skip ). To mitigate this, instead of accumulating
exclude_ids, we could keep count of documents to skip ( and use it in conjunction with
minDistance ). This would keep the size of the request constant, and the query performance within bounds of the simple
skip solution. However keeping track of
_ids to exclude has some advantages over using skip in cases where queried data is highly dynamic ( when changes are expected to happen in between fetches ), which we might discuss in future articles.
Finally, I’d like to draw your attention to the fact that the above method doesn’t facilitate jumping directly to a specific page ( without fetching all pages in between ). However this can be achieved by adding appropriate
skip ( equal to
page_size * pages_to_skip_count ) to the query. Using
skip will obviously incur a performance penalty proportional to the amount of documents skipped, but there is really no way around it ( unless instead of specifying amount of pages/documents to skip, your use case could do with with using
minDistance to skip an unknown number of documents ). Fortunately we only need to pay this price once per page jump, since once the desired page is fetched using
skip, the next one can be queried for based using only
We have demonstrated how to store and query geospatial data in mongodb, and discussed how to efficiently page through large amount of such data ( from closest location to furthest ). In Part 2 of this series, you will learn how to use a nifty trick to page through locations in reverse order.
Don’t forget clone the accompanying repo and play with code examples which can be easily ran from the command line.