I was asked that question on stage at a recent conference. My answer was a single word: deanonymization. I firmly believe that identifying and understanding relevant groups of actors is a pivotal challenge to unlock the potential of blockchain analytics. At IntoTheBlock, we spend quite a bit of time thinking about this problem and identifying the right boundaries that don’t conflict with the ethos of the crypto movement. Today, I would like to explore this idea a bit further.
The architecture of most blockchains in the market relies on anonymity of pseudonymity as a mechanism to protect the privacy of its nodes and enable decentralization. Those data obfuscation mechanisms enable crypto-asset transaction data to be recorded in public ledgers and available to everyone but also make their analysis extremely difficult. Without identity, it is hard to develop meaningful semantics and layers of interpretation of blockchain datasets and, without that, blockchain analytics will remain relatively basic. However, it is important to understand that deanonymizing blockchain datasets does not imply knowing the identity of every address in the ledger. That approach is nearly impossible to scale. Instead, we can settle for identifying and understanding the behavior of known actors like exchanges, OTC desks, miners and other parties that constitute key elements of a blockchain ecosystem.
Not All Addresses are Created Equal
Network metrics are an omnipresent metric in blockchain analytics and one that clearly illustrates the power of deanonymization. An address count is often a misleading metric for a simple reason: not all addresses are equal. An address created by an exchange to do a temporary transfer is not the same as a wallet holding meaningful savings for a long timeframe. Similarly, the hot wallet of an exchange like Binance should not be interpreted using the same semantics as my personal wallet. Looking at all addresses through the same lens of anonymity is conducive to limited and often misleading interpretations.
Anonymity vs. Interpretability
Anonymous or pseudonymous identities are a key element of scalable decentralized architectures but it also makes it extremely hard to obtain meaningful information from blockchain datasets. A way to understand this argument is by thinking about anonymity as a counter-factor to the interpretability of blockchain analytics.
The friction between anonymity and interpretability in blockchain datasets is relatively trivial. The more anonymous a blockchain dataset is, the more difficult to extract meaningful intelligence from it. Identities provide context and context is a key building block of interpretability.
What You Are is More Important than Who You Are: Deanonymization vs. Labeling
Deanonymizing blockchain datasets does not entail knowing the specific identity of every single actor. That’s not only a monumental task but also counterproductive past certain scale. Instead, we can settle for understanding the key characteristics or features of a specific actor to enable a meaningful level of interpretability to our analytics. So instead of clearly identifying the specific identity of an address, we can attach labels or metadata that allows to contextualize its behavior.
At scale, labeling is often a more powerful concept than identity. Understanding the specific identities of actors in the blockchain ecosystem certainly enable interesting levels of personalization but remains relatively limited when comes to understanding trends at the macro-level.
So the challenge of deanonymization is more related to identifying key labels or attributes of blockchain address rather than understanding specific identities. How can we go about that?
Machine Learning to the Rescue
The idea of labeling or deanonymizing blockchains to enable better analytics starts by understanding relevant patterns and characteristics of the known actors in the space. Intuitively, we would think of creating rules to qualify the different elements of a blockchain ecosystem. Something like this:
If an address holds a large position on Bitcoin and executes over 100 transaction per data then it’s an exchange….
While tempting, the rules-based approach will quickly fall short to deliver any meaningful intelligence. A few reasons for this:
1. Preset Knowledge: A rules-based classification assumes that we have enough knowledge to exactly identify the different actors in the blockchain ecosystem. This is obviously not true.
2. Constant Changes: The architecture of blockchain solutions changes al the time which will challenge any preset rules.
3. Number of Attributes: It is easy to create a rule with two or three parameters but try with twenty or a hundred. Identifying actors like exchanges or OTC desks require multiple combinations of dozens of attributes.
Instead of preset rules, we need an mechanism that learns patterns from a blockchain dataset and extrapolate meaningful rules that allow us to label relevant actors. Conceptually, this is a textbook machine learning problem.
From the machine learning standpoint, we should think about the deanonymization challenge combining two main approaches:
· Unsupervised Learning: Unsupervised learning focus on learning patterns about a given dataset and identify relevant groups. In the context of blockchain datasets, an unsupervised model could be used to segment a group of addresses into relevant groups based on their activity and attach labels to those groups.
· Supervised Learning: Supervised learning methods will leveraged previous knowledge to learn new characteristics about a given dataset. In the context of blockchains, supervised learning methods can be trained on a specific set of exchange addresses to detect new ones.
Deanonymizing or labeling blockchain datasets is rarely an election between supervised or unsupervised methods and rather a combination of the two. Machine learning models can effectively learn the characteristics of specific actors in the blockchain ecosystem and use that knowledge to understand its behavior.
Bringing layers of labeling or identity information into blockchain datasets is key challenge to enable more meaningful analytics. Labels bring better context and that enables better interpretability in analytic models. Despite of the powerful arsenal of machine learning stacks at our disposal, deanonymization remains an incredibly difficult roadblock on the journey to enable better analytics for the blockchain ecosystem.