Using machine learning to analyze blockchain datasets is a fascinating challenge. Beyond the incredible potential of uncovering unknown insights that help us understand the behavior of crypto-assets, blockchain datasets presents very unique challenges to a machine learning practitioner. Many of these challenges translate into major roadblocks for most traditional machine learning techniques. However, the rapid evolution of machine intelligence technologies has enabled the creation of novel machine learning methods that result very applicable to the analysis of blockchain datasets. At IntoTheBlock, we regularly experiment with these new methods to improve the efficiency of our market intelligence signals. Today, I would like to provide a brief overview of some novel ideas in the machine learning space that can yield interesting results in the analysis of blockchain data.

Blockchains datasets offer a unique universe of data related to the behavior of crypto-assets and, therefore, unique opportunities for the application of machine learning methods. However, the nature and structure of blockchain datasets brings its unique set of challenges to machine learning methods. While we might think that blockchain datasets are a paradise for machine learning applications, traditional methods typically encounter some unexpected challenges:

· Lack of Labeled Data: There is minimum labeled data in blockchain datasets that can be used to train machine learning models.

· Obfuscated Data: Blockchains are full of encrypted or obfuscated data that is nearly impossible to analyze.

· Lack of Models of Benchmark Against: Machine learning is all about benchmarking models against other models. That results a bit harder in a space with very few documented models that have produced credible results.

The Traditional Machine Learning School of Thought

Traditional machine learning practitioners divide the world in two types of models:

· Unsupervised Learning: Supervised learning as the name indicates the presence of a supervisor as a teacher. Basically supervised learning is a learning in which we teach or train the machine using data which is well labeled that means some data is already tagged with the correct answer.

· Supervised Learning: Unsupervised learning is the training of machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data.

In the context of blockchain datasets, supervised learning applications are limited given the limited availability of labeled datasets. Unsupervised methods can be very effective but its hard to judge their performance in the absence of other models or benchmark to compare against.

To help improve unsupervised and supervised methods in the analysis of blockchain data, we attempt to use some novel methods that have been gaining a lot of traction within the machine learning community in recent years.

New Machine Learning Methods That Can Help Us Understand Blockchain Datasets

We are living in the golden era of machine learning research and technology. Today, machine learning frameworks and platforms are rapidly incorporating many techniques that help to enable new capabilities beyond traditional supervised and unsupervised methods. We have found several of those techniques very relevant to the analysis of blockchain datasets.

Semi-Supervised Learning

Semi-supervised learning is one of the areas of machine learning that has received a lot of attention in recent years. Conceptually, semi-supervised learning is a variation of supervised learning that combines datasets of labeled and unlabeled data for training. The principle of semi-supervised learning is that leveraging a small amount of labeled through supervised learning with a larger amount of unlabeled data through unsupervised learning can yield better accuracy than completely supervised models in many scenarios.

In the context of blockchain analysis, semi-supervised learning can be used to train models that can classify actors like exchanges or wallets without relying on a large labeled dataset for training. For instance, a classifier can learn to identify crypto exchanges using a few labeled addresses and expand its knowledge using a larger pool of unlabeled addresses.

Transfer Learning

Transfer learning is a form of representation learning based on idea of mastering a new task by reusing knowledge from a previous task. Traditional learning is isolated and occurs purely based on specific tasks, datasets and training separate isolated models on them. No knowledge is retained which can be transferred from one model to another. In transfer learning, you can leverage knowledge (features, weights etc) from previously trained models for training newer models and even tackle problems like having less data for the newer task!

When comes to blockchain data analysis, transfer learning can be used to build models that can generalize knowledge from previous tasks. For instance, a model that identifies anomalous Bitcoin transfers can generalized its knowledge to the Ethereum blockchain.

Neural Architecture Search and AutoML

Designing machine learning models an incredible subjective task that often relies on experience from data scientists that are not objectively tested. A given machine learning problem can have infinite solutions and its very hard to understand whether we have the correct solution for the problem. What if we could make the design of machine learning models a machine learning problem? Clever huh?

Neural architecture search or AutoML is a novel technique that looks to automate the creation of machine learning models. Given a dataset, a series of optimization metrics and some constraints in terms of time or resources, AutoML methods should be able to evaluate tens of thousands of neural network architectures and produce an optimal result. While effective data science teams might be able to evaluate a dozen models for a given problem, an AutoML method can quickly search through tens of thousands of architectures in a relatively manageable time.

In the context of blockchain datasets, NAS and AutoML can help us evaluate large number of models for a given scenario. For instance, instead of designing a specific neural network for predicting exchanges fund flows, we could evaluate hundreds of models and come up with a more polished architecture.

Meta Learning

Meta-learning can be simply be defined as the ability to acquire knowledge versatility. As humans, we are able to acquire multiple tasks simultaneously with minimum information. We can recognize a new type of object by seeing a single picture of it or we can learn complex, multi-task activities such as driving or piloting an airplane at once. While AI agents can master really complex tasks, they require massive amounts of training on any atomic subtasks and they remained incredibly bad at multi-tasking. One type of popular meta-learning technique is known as few-shot learning which enables the creation of deep neural networks that can learn from minimalistic datasets mimicking, for instance, how babies can learn to identify objects by seeing only a picture or two.

In the context of blockchain analysis, we can use meta-learning to reuse models that identify patterns such as malicious transfers to identify helpful information like payment transactions.