Machine learning models tend to overfit when used with blockchain datasets. What is overfitting and how to address it?
The idea of using machine learning to analyze blockchain datasets sounds incredibly attractive at first glance but it’s a road full of challenges. Among those challenges, the lack of labeled datasets remains by far the biggest hurtle to overcome when applying machine learning methods to blockchain datasets.
These limitations cause a lot of machine learning models to operate with very small data samples for training and over-optimizing for those causing a phenomenon known as overfitting. Today, I would like to deep dive into the overfitting challenge in blockchain analysis and propose a few ideas to address it.
Overfitting is considered one of the biggest challenges in modern deep learning applications. Conceptually, overfitting occurs when a model generates a hypothesis that is too tailored to a specific dataset to the data making it impossible to adapt to new datasets.
A useful analogy to understand overfitting is to think about it as hallucinations in the model. Essentially, a model hallucinates/overfit when it infers incorrect hypothesis from a dataset.
A lot has been written about overfitting since the early days of machine learning so I won’t presume to have any clever ways to explain it. In the case of blockchain datasets, overfitting is a direct result of the lack of labeled data.
Blockchains are big, semi-anonymous data structures in which everything is represented using a common set of constructs such as transactions, addresses and blocks.
From that perspective, there is minimum information that qualifies a blockchain record. It’s a transaction a transfer or a payment? it’s an address an individual investor wallet or an exchange’s cold wallet? Those qualifiers are essential for machine learning model.
Imagine that we are creating a model to detect exchange address in a set of blockchains. This process require us to train the model with an existing dataset of blockchain addresses and we all know that those are not very common. If we use a small dataset from EtherScan or other source, the model is likely to overfit and make erroneous classifications.
One of the aspects that makes overfitting so challenging is that is hard to generalize across different deep learning techniques. Convolutional neural networks tend of develop overfitting patterns that are different from the ones observed recurrent neural networks which are different from generative models and that pattern can be extrapolated to any class of deep learning models.
Somewhat ironically, the propensity to overfit has increase linearly with the computation capacity of deep learning models. As deep learning agents can generate complex hypothesis at virtually no cost, the propensity to overfit increases.
Overfitting is a constant challenge in machine learning models but it’s almost a given when working with blockchain datasets. The obvious answer to fight overfitting is to use larger training datasets but that’s not always an option. At IntoTheBlock, we regularly encounter overfitting challenges and we rely on a series of basic recipes to address is.
The first rule to fight overfitting is to recognize it. While there are no silver bullets to prevent overfitting, practical experience have shown some simple, almost common sense, rules that help prevent this phenomenon in deep learning applications.
From the dozens of best practices that have been published to prevent overfitting, there are three fundamental ideas that encompass most of them.
Overfitting typically occurs when a model produces too many hypotheses without the corresponding data to validate them. As a result, deep learning applications should try to keep a decent ratio between the test datasets and the hypothesis that should be evaluated. However, this is not always an option.
There are many deep learning algorithms such as inductive learning that rely on constantly generating new and sometimes more complex hypothesis. In those scenarios, there are some statistical techniques that can help estimate the correct number of hypothesis needed to optimize the chances of finding one close to correct.
While this approach does not provide an exact answer, it can help to maintain a statistically balanced ration between the number of hypotheses and the composition of the dataset. Harvard professor Leslie Valiant brilliantly explains this concept in his book Probably Approximately Correct.
The data/hypothesis ration is very visible when comes to blockchain analysis. Let’s imagine that we are building a prediction algorithm based on a year of blockchain transactions.
Because we are unsure which machine learning model to test, we use a neural architecture search(NAS) approach that tests hundreds of models against the blockchain dataset.
Given that the dataset only contains a year of transactions, the NAS method is likely to produce a model that is completely overfitted for the training dataset.
A conceptually trivial but technically difficult idea to prevent overfitting in deep learning models is to continuously generate simpler hypothesis. Of course! Simple is always better isn’t it?
But what is a simpler hypothesis in the context of deep learning algorithms? If we need to reduce it to a quantitative factor, I would say that the number of attributes in an deep learning hypothesis is directly proportional to its complexity.
Simpler hypotheses tend to be easier to evaluate than others with large number of attributes both computationally and cognitively.
As a result, simpler models are typically less prompt to overfit than complex ones. Great! now the next obvious headache is to figure out how to generate simpler hypothesis in deep learning models.
A non-so-obvious technique is to attach some form of penalty to an algorithms based on its estimated complexity. That mechanism tends to favor simpler, approximately accurate hypothesis over more complex and sometimes more accurate ones that could fall apart when new datasets appear.
To explain this idea in the context of blockchain analysis, let’s imagine that we are building a model for classifying payment transactions in the blockchain.
The model uses a complex deep neural network that generates 1000 features to perform the classification. If apply to a smaller blockchain such as Dash or Litecoin, that model is very likely to overfit.
Bias and Variance are two key estimators in deep learning models. Conceptually, Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. A model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
Alternatively, Variance refers to the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
How are bias and variance related to overfitting? In super simple terms, the art of generalization can be summarized by reducing the bias of a model without increasing its variance.
A good practice in deep learning models it to regularly compare the produced hypothesis against test datasets and evaluate the results. If the hypothesis continue outputting the same mistakes, then we have a big bias issue and we need to tweak or replace the algorithm. If instead there is no clear pattern to the mistakes, the problem is variance and we need more data. In summary:
In the context of blockchain analysis, the bias-variance friction is present everywhere. Let’s go back to our algorithm that attempts to predict price with a number of blockchain factors. If we use a simple linear regression method, the model is likely to underfit. However, if we use a super complex neural network with a small dataset, the model is likely to overfit.
Using machine learning to analyze blockchain data is a very nascent space. As a result, most of the models are encountering the traditional challenges with machine learning applications.
Overfitting is one of those omnipresent challenges in blockchain analysis fundamentally due to the lack of labeled data and trained models. There is no magic solution to fight overfitting but some of the principles outlined in this article have proven to be effective for us at IntoTheBlock.
(Disclaimer: The Author is the CTO at IntoTheBlock)