Open Data + Node.JS + Elasticsearch — 13 million street addresses and counting

You can view the code for this project here. https://github.com/matthaywardwebdesign/aus-search

A few months ago whist working on my latest project, a property management SaaS startup, I was tasked with developing a fairly standard autocompleting address input. If you are anything like me your mind would have jumped to integrating with the Google Maps geocoding API and calling it a day. And that’s exactly what I did. It’s pretty straight forward, a few lines in Java.

One has to wonder whether it should be this easy.

Google’s Geocoding API

Let me just say that Google’s Geocoding API is great, not awesome (as you will find out) but still pretty good. Easy to implement and super fast just to name a few of the things it has going for it. However when using any form of third-party API it does come with a few downsides.

Rate limits
Might not be easy to switch to another provider (don’t want to get locked in)
Accuracy is questionable in areas
Only free up to a certain point (~ 2500 requests per day)

Now for doing something like verifying a users residential address for an eCommerce site or showing the nearest McDonalds, most of these “downsides” can be ignored, however for certain use cases they can make a big difference. For the project I was working on we needed highly accurate address data so we could do cool things like generate legal lease agreements, where accuracy is pretty damn important. Upon my initial testing the Google Geocoding API worked great, I typed in my address and got a list of 5 suggestions quickly, however a not so large amount of user testing I soon discovered a slight issue.

Hmm we have a problem

Let’s say I live at Apartment 12, 342 Gravelpit Terrace (points for you if you get the reference), there is plenty of ways to enter this address, but let’s just say I enter it a follows.

13/342 Gravelpit Terrace

Now one would expect that the same address with it’s associated metadata would be returned to me, but you never know what you are going to get. Instead the response came back as:

342/13 Gravelpit Terrace

In addition to this issue we discovered a few cases where addresses couldn’t be found, apartment numbers not showing up correctly and lots of other little issues. Now after leaving this item on the “todo” list for quite sometime I decided to finally attempt to fix it. I asked myself “Surely there is an open data address database out there somewhere?” and low and behold……

G-NAF

“G-NAF is a database of all the physical addresses in Australia.”

Aha! This looks interesting! I’ve hit the motherload.

Now upon first look there was a few things that dropped my confidence in looking for a free open solution. The first being the fact that the word products is in the URL and the second being the following line:

“Due to the complex nature of G-NAF, PSMA works with a network of value-added resellers…”

And I almost navigated away from the page until I found this beautiful small sentence in the sidebar of the page.

“G-NAF and Administrative Boundaries are also available from www.data.gov.au”

Okay now we are getting somewhere! Looks like exactly what we need. One Reddit user PraiseB summed up my excitement of finding the data set pretty well…

“We were about to pay a metric f**kton for access to just a part of that data”

So I downloaded the fairly big 1.39gb zip file opened it up and discovered that it was made up of about ~180 pipe separated value files, they can say “complex nature” again that’s for sure. But who doesn’t love a challenge.

Too many files

There is a lot of data here

First things first I wanted to get the data into a form that was a bit easier to understand so I could then simplify the data into the right form and strip out the attributes that weren’t applicable to my use case. Knowing that my end goal was to index the documents I decided to parse the files and insert each document into a Mongo database. Whilst a relational database would have possibly been better for this task I didn’t want to spend time setting up schemas when I knew my desired end goal was a single document for each address and a non relational database fit this requirement perfectly.

The code for that goes something like this…

In reality it’s a little bit more complex, because I standardise some of the data and then put it into batches to insert but you get the drift. So I whip up a quick MongoDB instance using the awesomeness of Docker and run my Node script and a short time later ( a long time initial until I discovered batch inserts were a lot quicker ) I end up with all the data files imported nicely into a Mongo database.

Getting better but lots and lots of collections

Now with the data in a tad more user friendly format I’m able to go through the data and find the attributes and sections of data I’ll need to simplify all this data into a format that’s ready to index for searching. As you can see on the left a lot of the collections relate to each other. For example addressDetail, addressGeocode, addressSite and addressType all relate to the attributes of a single address.

Based on this I decide to simplify the data down into 4 collections

Address
Street
Locality
State

With the Address collection containing data from the Street, Locality and State collections.

So now that I’ve got my data into a more user friendly form and I’ve figured which bits I want to throw away and which bits of data I need to pull from the different sections to create a more simplified version of the ~180+ file dataset I can get to work wrangling the data with another Node script.

A quick side note

No doubt after reading this somebody will ask my “Why did you use Node.JS, Language X is so much better” and without a doubt there is probably plenty of better, faster languages for doing data processing. However I’m a very big believer in using the tool you are most comfortable with, and having written a lot of Javascript lately it felt like a quick and easy solution.

https://xkcd.com/378/

Back to the code

So now that I’ve figured out my desired data format I starting working on another piece of code to simplify all the collections down into just a few.

This is a little snippet of the code that does the simplifying. I really like my solution for this so I thought I’d quickly go over it.

I’ve created a few ES6 classes to represent the data Locality for example, as well as a few helper classes like Simplifier. In the example above my input collection is locality and my output collection is LocalitySimple which as the name suggests is a simplified version of a locality (fancy term for town).

Every document in the collection we are reading from is passed into the function provided where I simplify it and pull the bits of data I need from other collections and places before finally calling the resolve function with simplified document. It is then inserted into the new simplified collection.

I run my script over all the data and quite some time later……

Finally some nice data!

After simplifying all our data we now have a collection of addresses with our data in a structure format with all the attributes in the one spot.

Note: This address was picked at random and some fields have been removed for brevity.

Finally some nice clean address data for ~13.5 million addresses that we can now index for searching.

It’s time to search

Now that I’ve got detailed information on pretty much every address in Australia all I have to do is index it for searching. For this part of this project I decided to use Elasticsearch.

elastic/elasticsearch_elasticsearch - Open Source, Distributed, RESTful Search Engine_github.com

As the name and GitHub repo description suggest Elasticsearch is great for searching data. Before this project I hadn’t used it before so I was super keen to give it a try.

Getting Elasticsearch running was as simple as adding a few lines to my docker-compose.yml file and running docker-compose up. Gosh I love Docker!

The only step left was to once again batch import all of the simplified addresses into Elasticsearch and let it index them. An hour or so later I finally had what I had set out to achieve, a fully searchable, super fast database of Australian addresses, with a super easy REST API that made querying all 13.5 millions records as easy as a single GET request to the link below.

http://localhost:9200/address/singleAddress/_search?q=address_here

Final notes

With a combination of open source software, and open datasets I was able to solve our teams issues with the Google Geocoding API, whilst providing our users with a better user experience and higher quality data. All without relying on a 3rd party provider.

With plenty of governments worldwide starting to realise the benefits to their businesses and citizens of releasing open data I can’t wait to see the kind of stuff we’ll be able to achieve in the coming years!

Wanna see the code?

All the code used in this project is available on GitHub

matthaywardwebdesign/aus-search_aus-search - A collection of Node JS scripts to create an Elasticsearch index of Australian addresses._github.com

Connect with me

Matt Hayward | LinkedIn_Fox and Lee is an award winning Melbourne based design studio specialising in website design and development, branding…_www.linkedin.com

matthaywardwebdesign (Matt Hayward)_matthaywardwebdesign has 23 repositories available. Follow their code on GitHub._github.com

Thanks for reading!

Incorporates or developed using G-NAF ©PSMA Australia Limited licensed by the Commonwealth of Australia under the Open Geo-coded National Address File (G-NAF) End User Licence Agreement.