rukavitsya

@rukavitsya

User tracking: tales from the trenches

A skilful traveller leaves no track.
A skilful speaker makes no slip.
A skilful reckoner needs no counting rod.

Laozi, The Tao Te Ching, chapter 27

The process of development of humanity from a hunter-gatherer society to a society of buyers was completed.

Ancient merchants were developing new horizons for selling goods, as consequence, it resulted in the appearance of such phenomena as Silk Road, Dutch East India Company. The last one was the first corporation to be ever actually listed on an official stock exchange, it even contributed to the process of formation of a completely new field: equity market. A similar story occurred with the advent of the Internet. This has led to the emergence of new methods, which evolved over time. Initially, user tracking fostered to the increase in selling/revenue because merchants became able to discover target audience. Recently user tracking has begun to occupy wider niches: crime prediction, prevention of terrorist acts, and of course, a new era of ads.

The evolutionary process can not be stopped. As in the middle of the Devonian period, about 385 million years ago, early air-breathing fish could remain on land for extended periods, due to evolutionary processes, after a long, long time after that first humans appeared. Similar to the previous statement, tracking of users have undergone a process of evolution: from well-known cookies to those cookies which can’t be deleted, from naive collecting IP address to AI driven techniques.

Tracking methods

There are two fundamental reasons for tracking users: marketing and spying. Marketers need to correlate products to the appropriate audience, everything is clear with this case. Spying can be a primary objective for different purposes: deanonymization in TOR network or on usual site, some malicious actions, fraud/scam, or even intelligence. The second reason makes me think about how to deal with them.

Museum of spies in NYC: at one time the future generation will make an exhibit for user tracking tools

The main purpose of the article is to familiarize you with the main methods of tracking and to make your surfing through the Internet a little safer.

Let’s get down to business.

Cookies

The oldest and most intelligent tracking method, cookies were designed to identify a user on a specific site. The principle of operation is very simple: if there is no information about a user, the site decides that a user on the site for the first time, generates a unique identifier and writes it (with some additional information about the user) in the cookie. At the next visits, a site can recognize a user by leveraging an identifier that was recorded in cookies. In theory, after clearing cookies, a user becomes anonymous for a site.

In Java and PHP world, JSESSIONID and PHPSESSID accordingly are a famous example of general purpose cookies, the process of authentication is often not feasible without them in old-style MVC application. In addition to the HTTP Cookie, there are also Flash Cookies and Silverlight Cookies.

Canvas Fingerprinting

Canvas fingerprinting is a technique for tracking users, which allows websites to identify and track visitors by exploiting the HTML5 canvas element. The key idea under canvas fingerprinting is drawing a hidden line of text and transformation this data into a unique identifier which is used for tracking without persistence on the client. Variations in GPU/graphics driver cause the diversity in the fingerprint. Fetching as much as possible metadata from a browser fosters to better quality.

The next info is often used as a source for an identifier based on canvas fingerprinting:

  • Screen resolution
  • Color depth
  • Platform
  • Time zone
  • User agent
  • Languages
  • List of IE specific plugins
  • List of installed fonts
  • DNT tag
  • Presence of session and local storage, indexed DB
  • AdBlock presence
  • Value from drawing hidden 3D canvas

After collecting necessary info, hash function is applied to summarized metadata. Non cryptographic hash function, like MurmurHash3 is suitable for this aim. I created the basic implementation of canvas fingerprinting for testing purpose, check source code. For usage in production, take a look at fingerprintjs2 library.

Reddit attempted to fetch canvas fingerprint, but was locked by Firefox

In fact, this kind of tracking was widely used for deanonymization in Tor network. Tor’s developers made a patch for this exploit. Firefox notifies of an attempt to get canvas fingerprint and prevents this action, Chrome has special plugins for the same. In addition to the Canvas Fingerprint, WebGL Fingerprint and WebRTC Fingerprint also can be used in a similar manner.

Evercookie

Evercookie is kind of super tuned HTTP Cookie. Evercookie was created by Samy Kamkar, it produces zombie cookies in a web browser that are intentionally difficult to delete.

According to readme in repository:

Evercookie is a Javascript API that produces extremely persistent cookies in a browser. Its goal is to identify a client even after they've removed standard cookies, Flash cookies (Local Shared Objects or LSOs), and others.

This is accomplished by storing the cookie data on as many browser storage mechanisms as possible. If cookie data is removed from any of the storage mechanisms, evercookie aggressively re-creates it in each mechanism as long as one is still intact.

If the Flash LSO, Silverlight or Java mechanism is available, Evercookie can even propagate cookies between different browsers on the same client machine!
Evercookie in comparison to usual

Evercookie makes copy of itself to different places so prevent removing of itself, it uses the next places for this aim:

  • HTTP/Flash Cookies
  • HTML5 Local/Global/Session storage
  • Web SQL Database
  • Silverlight storage
  • Web Cache
  • Web History
  • HTTP ETags
  • RGB values of auto-generated, force-cached PNGs using HTML5 canvas tag — (OMG, cookie in PNG!)

Removing of Evercookie is very difficult. However, it has one drawback — as well as in the case of cookies, the data is saved to the hard disk (or to the NAND-memory of the phone). And this means that incognito mode is invulnerable to Evercookie.

This type of tracking can be used for deanonymization. According to leaked documents by Edward Snowden, Evercookie is a method of tracking Tor users.

IP Tracking

Using a client’s IP-address allow to find out the location and the name of a provider. However, due to the periodic IP change in both wired and wireless networks, this method is extremely unreliable and in practice is used only for approximate location determination, with static IP-address tracking has more sense. In an era of omnipresent VPN/Proxy services, IP tracking is useful for correlation of IP addresses to a client in the case on not blocked Canvas Fingerprint/Evercookie.

Cross device tracking

Schema of cross device tracking

Companies like Twitter, Facebook, Reddit are able to correlate users with their devices because their mobile applications require authentication. Other companies also want to have this kind of data, but often their site visitors are not authenticated. Except for information about smartphones, data about smart TV, watches, gaming consoles also is super valuable and brings broader customer experience. Cross device tracking comes to rescue in this situation.

There are two basic types of cross device tracking. The first approach is based on collecting a bunch of metadata: location, device IDs, behaviral analysis, web history. Building device graph is the final aim of this technique. The second type is based on using inaudible ultrasonic audio beacons, emitted by one device and detected and recognized by the microphone of another device.

Sneak peek on uXDT

Experts call this technology uXDT (ultrasound cross-device tracking). It’s feasible to apply uXDT for deanonymization in the Tor network.

Demo of deanonymization in the TOR network with uXDT

Behavioral analysis

Tracking based on the individual behavior of a user utilizes the next indicators: features of surfing, like a speed of scrolling, the speed of movement of a mouse, frequently used search filters, preferred products, frequency of clicks and etc. Heavyweight scripts that slow down a browser are an obvious drawback of this method.

Worth mentioning that behavioral analysis is often applied as an auxiliary technique with different AI-driven approaches. Not without reason, IT giants hire the best minds in the AI area with the target to increase KPI. Different ML methods are suitable for enrichment collected data from tracking to valuable information like CTR, conversion rates. It’s a wide range from simple Linear regression to more advanced methods like Hidden Markov model.

Voilà, that’s it!

More by rukavitsya

Topics of interest

More Related Stories