259 reads

Turning Users Invisible: Edge Machine Learning in Privacy-First Location Detection

by Kimmo IhanusDecember 19th, 2020

Too Long; Didn't Read

How to use edge machine learning in browser for privacy-first location detection. Turn your users invisible while building location-based websites.

Company Mentioned

Coin Mentioned

featured image - Turning Users Invisible: Edge Machine Learning in Privacy-First Location Detection

At the beginning of the pandemic, the discussions around location privacy in COVID-19 related software got me thinking.

Would it be possible to do location detection directly in the browser and thus skip sending any coordinate data, over the internet, to 3rd parties for geocoding? This way, developers and businesses could have one less third party handling their customer data. If it worked, it would also be the first truly private way of doing location detection.

In essence, it would be like an invisibility cloak for the users of location-based applications.

I am no data scientist, but having a background in building machine learning web applications gave me the hunch that a completely new and safer way of doing location detection could be possible.

A good starting point for machine learning

From the start, all the pieces for machine learning solution seemed to be in place:

Training data (latitude and longitude coordinates connected to place names) is publicly available as open-source, or I can purchase it.
Browsers and mobile devices have a native function, the Geolocation API, for delivering the input data in the browser or mobile apps.
There are open-source machine learning libraries available for JavaScript.

First hack: chaining of neural networks

I chose to use Brain.js for machine learning. After some trial and error, I managed to get a good enough precision (98%) for continent and country predictions by processing my coordinate data with a geohash function and using a feed-forward neural network. However, I was facing problems going deeper into the location accuracy (e.g. from a country to a city). The neural network was becoming so big and heavy that it wasn't practical anymore.

I needed to come up with a solution.

I reasoned that since location data is hierarchical by nature, it would allow me to split the neural network into smaller pieces. I could start by predicting only the continent and then proceed further, step by step, to deeper levels of accuracy. The outcome of the continent prediction would be the starting point of choosing the next neural network that would predict the country, using a neural network that includes only data specific to that continent. I could repeat this process until I would get to the smallest desired level of prediction (e.g. the city name).

It worked.

By forming this type of "chain", I was able to keep the training of single neural networks reasonably light and fast. Also, the chaining reduced the size of single neural networks significantly.

In the end, I had over 5000 individual neural networks pre-trained for locating a user.

Hacking the problems in a browser environment

To make this work in browsers, I needed to integrate my neural networks to a client-side JavaScript. My biggest issue was not to get my solution working, but how to make it even faster and lighter.

The first issue was the size of the my machine learning SDK. Using the script on a website might affect the website's loading speed and, consequently, lower the site's Google reputation.

To avoid this, I removed all the unnecessary parts of the brain.js client-side code and reduced the unpacked and un-minified JavaScript file from 1400 KB to 170KB.

Now my solution was significantly lighter and got a boost to its speed.

Distribution of neural networks from CDN

The final critical challenge was to develop a process for distributing the neural networks to the client-interface.

First, I tried HTTPS-calls to deliver the neural network JSON-data from our server, but in some cases, the network calls failed due to too big data size in the body.

Eventually, I solved the puzzle by storing all my neural networks as JavaScript files to AWS S3 storage and using StackPath CDN for fast edge delivery. I tweaked my client-side script to use script injection to place the networks in my client code's variables that brain.js uses to load the pre-trained neural networks.

The current setup loads the first continent level neural network during the page load and then calls the needed neural networks from our server when the prediction moves down in the chain.

The current state

Although we were able to solve the issues and come up with a system for locating the user in the client-interface by using machine learning, there are still details that need to be optimized.

How to achieve 100% data protection? Currently, we are calling the neural networks by name. This means that even though the user's coordinate data is kept safe in the browser, we are still passing some location information in the request URL. Although everything is done in HTTPS, I would still prefer a 1 clean delivery.
The speed. I was able to make my system fast, but in cases where the neural networks' size is big, it can take a few seconds to get the location.
Cross-site scripting. Script injection is often related to XSS attacks, and some users' browsers might block the execution of scripts that are interpreted as threats.

The benefits

This kind of approach to solving location detection will benefit companies that want to optimize their users' online experience, simultaneously providing ultimate privacy. Most location detection use cases only require the approximate location of the user. I hope to provide a solution that does not include unnecessary data transfers or intrusive location detection accuracy for these use cases.

One good side effect of doing location detection via chaining is that it allows the users to control their location identification level (country, state, city, district). We believe that we will see features like this integrated into modern privacy-first browsers in the future and believe that this approach is the right way towards user-centricity in location detection.

The author is Co-Founder at Grew where our team is developing pointNG.io: a developer tool that uses machine learning for building location-aware websites with maximum security.