I want to talk a little about how you can use content-based addressing (aka data fingerprinting) as a general approach to make your applications faster and more secure with some practical JavaScript examples.
I find the concept of content-based addressing to be dope af.
It’s an extremely powerful tool for building services that are fundamentally more performant, scalable and secure. 💪
It’s related to immutability, decentralization, data integrity, and more buzzwords…
But it’s also so useful and under-appreciated in general that I wanted to write a practical intro to show how it works alongside some real-world JavaScript.
What the hell are you talking about?
You can think of content-based addressing as fingerprinting for data.
Just like how fingerprints allow you to:
Just replace “person” with “data” in the above descriptions and you have a rough overview of what content-based addressing enables.
Put another way, content-based addressing allows you to uniquely and efficiently reference data based on it’s actual content as opposed to something external like an ID or a URL.
Database-generated IDs, random GUIDs, and URLs are all useful in their own right, but they’re not quite as powerful as data fingerprinting.
Let’s see how this looks with some real-world code that I’ve used for reals:
const pick = require('lodash.pick')
const stableStringify = require('fast-json-stable-stringify')
const data = pick(myData, ['keyFoo', 'keyBar'])
const fingerprint = hash(stableStringify(data))
This tiny snippet is hiding so much power…
This snippet leaves out the hash function (more on that below), but it does represent the core algorithm pretty clearly.
It creates a content-based hash
fingerprint
of any JavaScript object myData
that is a unique representation of that object based on the keys we care about [ 'keyFoo', 'keyBar' ]
.In short, this fingerprint offers you a very efficient way of telling when two JavaScript objects are the same.
If two content-based IDs are the same, the data in those objects is the same.
No need for a deep comparison. No need for Redux. Just pure immutable goodness.
We’ll break this process into three distinct steps: 1) Input data 2) data cleaning 3) simplification.
Let’s take another look at our JavaScript code:
const pick = require('lodash.pick')
const stableStringify = require('fast-json-stable-stringify')
const data = pick(myData, ['keyFoo', 'keyBar'])
const fingerprint = hash(stableStringify(data))
First, we take as input any JavaScript object myData. This could be a model from your database or some object containing Redux-like app state, for instance.
Second, we clean our data to ensure that we’re only considering parts of the data we actually care about via
lodash.pick
. This step is optional but usually you'll want to clean your data like this before proceeding. I've found in practice that most of the time there will be parts of your data that aren't actually representative of the uniqueness of your model (we'll refer to this extra stuff as metadata 😉).As an example, let’s say I want to create unique IDs for all of the rows in a SQL table. Most SQL implementations will add metadata to your table like the date an entry was created or modified, and it’s unlikely we’d want this metadata to affect our notion of uniqueness. In other words, if two rows were inserted into the table at different times but have the exact same values according to our application’s business logic, then we want to treat them as having the same fingerprint so we filter out this extra metadata.
Third, we simplify our cleaned data into a stable, efficient representation that we can store and use for quick comparisons. Most of the time this step involves some sort of cryptographic hash to normalize the way we refer to our content in a unique, concise manner.
In the code above, we want to make sure that our hashing is stable, which is made easy for us by the fast-json-stable-stringify package.
This awesome package recursively makes sure that no matter how our JavaScript object was constructed or what order its keys may be in, it will always output the same string representation for any two objects that have deep equality.
There are some details this explanation is glossing over, but that’s the beauty of the NPM ecosystem — we don’t have to understand all the bits & pieces to take advantage of their abstractions.
Up until now, we’ve glossed over the hashing aspect of things, so let’s see what this looks like in code:
const hasha = require('hasha')
const hash = (input) => hasha(input, { algorithm: 'sha256' })
Note that there are lots of different ways you could define your hash function. This example uses a very common SHA256 hash function and outputs a 64-character hex encoding of the results.
Here is an example output fingerprint:
2d3ea73f0faacebbb4a437ff758c84c8ef7fd6cce45c07bee1ff59deae3f67f5
Here is an alternative hash implementation that uses the Node.js crypto package directly:
const crypto = require('crypto')
const hash = (d) => {
const buffer = Buffer.isBuffer(d) ? d : Buffer.from(d.toString())
return crypto.createHash('sha256').update(buffer).digest('hex')
}
Both of these hash implementations are equivalent for our purposes.
The most important thing to keep in mind here is that we want to use a cryptographic hash function to output a compact, unique fingerprint that changes if our input data changes and remains the same if our input data remains the same.
Once you start thinking about how data can be uniquely defined by its content, the applications are really endless.
Here are a few use cases where I’ve personally found this approach useful:
We’ve really only started to scratch the surface of what you can do with content-based addressing. Hopefully, I’ve shown how simple this mindset shift can be done in JavaScript and touched on a bit on the benefits this approach brings to the table.
If you enjoy this stuff, I would recommend checking out:
Thanks! 🙏
Previously published at https://blog.saasify.sh/content-based-addressing/