It is easy to be annoyed by strange anomalies when they are sighted within otherwise clean (or perhaps not-quite-so-clean) datasets. This annoyance is immediately followed by eagerness to filter them out and move on. Even though having clean, well-curated datasets is an important step in the process of creating robust models, one should resist the urge to purge all anomalies immediately — in doing so, there is a real risk of throwing away valuable insights that could lead to significant improvements in your models, products, or even business processes.
So what exactly do I mean by “data anomalies”? There is no single definition for what constitutes an anomaly, as it depends both on the nature of the data and one’s understanding of the processes generating that data (i.e., anomaly is in the eye of the beholder). They are essentially patterns that deviate significantly from the expected behaviour, leading one to believe that there’s either (1) an error somewhere or (2) a new, unknown cause for the observed deviation. Either possibility should give one pause before hitting the delete button and moving on. If it’s an error, is it random inconsequential noise or a systematic issue somewhere in the process? Could the underlying reason be causing other, less visible issues in the data? If it’s not an error but a new phenomenon, what are its implications? Does it herald a new trend in the market which the business would otherwise miss out? If some of these questions could apply to your data, then anomalies may actually be valuable and deserve to be examined with due care.
At Vortexa we obtain vessel and cargo data from multiple sources in order to generate the most complete view into waterborne oil flows around the world. As in other industries, data quality can vary considerably across different sources, and thus to avoid the infamous GIGO (garbage in, garbage out) we have set up a process to clean and curate each training dataset used by our Machine Learning models. In this post, I describe some lessons we’ve learned as we’ve grappled with some anomalies in our datasets.
Anomalies can be detected using model-free or model-based approaches. Model-free methods rely on a distance metric to identify samples that are “far away” in some sense from other observations within a dataset. Some examples of model-free methods are clustering, nearest-neighbour, and information-theoretic approaches. These methods do not assume a particular structure or distribution in the data, other than the existence of groups of points that are relatively close to one another (clusters) and points that do not seem to belong to any cluster (anomalies). In contrast, model-based methods are based on a set of assumptions about the process generating the data. I will focus on model-based anomaly detection for the remainder of this post.
Let’s start by looking at a classic textbook example of a model-based anomaly detector. In this example, our observations are univariate real numbers which we represent as variable x. If we assume that x is generated as independent random samples from a normal distribution with mean μ and standard deviation σ (i.e. x ∼ N(μ, σ)), then we can define as anomalous all observations that are more than 3 standard deviations away from the mean (i.e., |x-μ| > 3σ). Then, if our assumption is correct, the probability of observing an anomaly by chance is less than 0.3%. If the number of anomalies turns out to be significantly larger than this, we can be certain that they are generated by a different kind of process than represented by our model and need further investigation.
Machine Learning methods can be used to build efficient anomaly detectors. Assuming that one starts with a curated, anomaly-free training dataset D comprised of data points (xᵢ, yᵢ) where xᵢ are feature vectors and yᵢ are class labels, supervised learning methods such as logistic regression, Bayesian networks, and neural networks (among many others) can be used to estimate P(y|x) — the conditional probability distribution of class labels given a set of features. This estimated distribution will reflect the patterns in D as well as the underlying assumptions in the chosen supervised learning algorithm. This model can be used to detect potential anomalies among new, unseen data points (xᵢ’, yᵢ’) by checking for samples that contain an unlikely class label. In other words, for a given probability threshold τ, anomalies are defined as data points that have class label probabilities below the threshold: P(y=yᵢ’ | xᵢ’) < τ.
Anomaly detection is an old problem in statistics and a multitude of algorithms have been created over the years to address it, some of which are more appropriate in specific domains than others. The advantage of the model-based approach proposed above is that it can be readily applied if one has already built a classification model from a curated dataset. If however you do not have an anomaly-free training dataset or your data does not contain categorical output labels, then you may try modifying the approach above (e.g., by using a density estimation method) or using a model-free approach.
Diagnosing the underlying issue(s) causing the anomalies is the most valuable step in the clean-up process, but also the hardest. In some cases, it may require deep expertise in the industry or process generating the data as well as a solid understanding of statistics and the assumptions inherent in your model. If you used the model-based approach proposed in the previous section, then all we know is that the detected anomalies deviate from the patterns in the training dataset as captured by the supervised learning method. We now need to understand what may be the cause for this deviation — there are several possibilities:
1. Expected noise in the data-generating process. This is the simplest explanation, and if it is the only reason for the anomalies, then the number of anomalies detected can be estimated theoretically (as in the “classic textbook example” above);
2. Unexpected noise or error in the data-generating process. This may be the case if the number of anomalies is larger than expected. Data processing errors sometimes go undetected, so it is always advisable to inspect the raw data together with the final processed records. Trivial issues in the data can often be identified visually, so eyeballing the anomalous records is usually a good first step.
3. A previously observed feature pattern x’ with a new class label y’. If the feature pattern x’ in the anomalous record has several similar instances x in the training dataset but which crucially have a different class label y≠ y’ attached to them, then this direct contradiction needs to be resolved by a domain expert. If this anomalous record is deemed to be an error, then it needs to be filtered out or corrected. If, however, the anomaly is found to be accurate, then it would signify a shift in the observed patterns (e.g. due to changing market dynamics). The model would have to be retrained with these new data points and additional context so that it can detect the new patterns and adjust its predictions.
4. A new feature pattern x’ not previously observed in the training dataset. When trained correctly, supervised learning models should generalise to unseen patterns. Even if a specific set of features had no equivalent in the model’s training dataset, learning algorithms can extrapolate from the existing patterns in the training dataset and predict the distribution of class labels P(y|x’) for the unseen pattern. If the predicted probability for the class label y’ was low (which caused the record to be flagged as anomalous), then there are two possibilities: (a) the model is correct and the data point x’, y’ is indeed an anomaly — in which case we again need to determine whether it’s a data error or a legitimate shift in pattern (see point 3 above); or (b) the model is wrong and the data point is not an anomaly. When models fail to generalise to unseen patterns, it could be for a number of reasons: insufficient support in the training data, poorly tuned hyperparameters, insufficient set of features, or wrong underlying assumptions (e.g. linearity, independence of factors). A large number of incorrect anomaly predictions may be an indication that the model needs to be revised.
Detecting and diagnosing data anomalies can be challenging, especially as the amount and complexity of data continue to increase seemingly without bounds. A mix of Data Science and industry expertise may be needed to resolve the most complicated cases, when it is not clear whether the model prediction is incorrect, or whether the anomaly reflects a new, real-world phenomenon. Despite its challenges, your organisation could reap enormous benefits by setting up a process to review potential data anomalies periodically. Not only would it keep the datasets clean and models improving continuously, it could provide the business with invaluable early signals of shifts in market dynamics. When seen this way, data anomalies cease being a source of annoyance — they suddenly become a source of opportunities.