Methods to Tackle Common Problems with Machine Learning Models

Predictive Analytics models rely heavily on Regression, Classification and Clustering methods. When analysing the effectiveness of a predictive model, the closer the predictions are to the actual data, the better it is. This article hopes to be a one-stop reference to the major problems and their most popular/effective solutions, without diving into details for execution.

A clustering algorithm plot

Primarily, data selection and pruning happens during the Data Preparation phase, where you take care to get rid of bad data in the first place. Then again, there are issues with the data, and their relevance to the ML model’s objectives during training, troubles with usage of algorithms, and errors in the data that occur throughout. Effectively, the model is tested for bias, variance, autocorrelations, and many such errors that can occur when finalizing the model. Before finalizing the model, some defined tests are performed on the data- these are test algorithms that detect such errors.

After running these tests, you go back to the model and make those corrections, and approve the model as fit, or ‘good’. But, the best of the industry have figured out ways in which such errors can be avoided during further iterations.There are multitudes of errors that can occur, but let’s explore few of them with well-defined and most effective, tests and solutions:

Overfitting and Underfitting

Overfitting and Underfitting Problems can be explained with the Bias-Variance Tradeoff Property:

Bias is an error in the learning algorithm, when the learning algorithm is weak to learn from the data. In case of high bias, the learning algorithm is unable to learn relevant details in the data. Hence, it performs poor on the training data as well as on the test dataset. On the other side, variance is an error in the learning algorithm, when the learning algorithm tries to over-learn from the dataset or tries to fit the training data as closely as possible. In case of high variance, the algorithm performs poor on the test dataset, but performs pretty well on the training dataset.

Bias-variance tradeoff is a serious problem in machine learning. It is a situation when you can’t have both low bias and low variance. But you have to have a tradeoff by training a model which captures the regularities in the data enough to be reasonably accurate and generalizable to a different set of points from the same source, by having optimum bias and optimium variance.

Bias and variance are two errors in the total error in the learning algorithm, if you try to reduce one error, the other error might go up.

How does bias and variance contribute to Overfitting and Underfitting?

To determine a perfect fit for the model, we analyse how the test samples/data points were considered for model analysis. When parsing through millions of rows, it is possible that you try to include all the data points, whether relevant or not, or cross the threshold in foregoing them. The crux here is to not include every data point to perfection, nor travel very further from neglecting data points when you try to fit a curve.

When the learning algorithm has the high bias problem, working on reducing the bias will cause the variance to go up, causing over-fitting problem. And, when the learning algorithm is suffering from the high variance problem, working on reducing the variance will cause the bias to go up, causing under-fitting problem. That is where the term ‘trade off’ comes into existence, as reducing just the bias, will not improve the model, and vice versa. The ‘sweet spot’ is to land the data points at a place where there is optimum bias and optimum variance. Basically, find a pattern by not taking any of the extremes such that it tampers accuracy. Most of the time, the planning and choosing of these points are the biggest challenges that data Scientists and Analysts face.

The best fit may not be the one that excludes outliers to the T, but is always a compromise

However, there are methods of testing for the fit of the model. Some solutions provided to tackle these phenomenon are:

Answer to Bias-Variance Tradeoff Problem:

Build A More Complex Model

The first and simplest solution to an underfitting problem is to train a more complex model to fix the problem. And, for an Overfitting model, get more data in. and regularization.

Cross Validation

In cross-validation, all the available or chosen data is not used in training the model. There are usually three folds that help in performing the cross-validation method- the training data, test data, and validation dataset. You can use a combination of Training and Test data alone, or use all three data folds.

[Training data = for model training

Test data = for model hyperparameter tuning

Validation data = for model validation and accuracy estimation]

There are many ways to work with these folds, and The Training data is usually 60% of the total dataset, the test dataset will be 20% and, the validation data set comprises of the remaining 20%.

The quality of the trained model is tested by first training the model by using just the training data, and then compare that model with the model that is trained with the test data. In this manner, we can identify which data points bring about a better prediction. There are many variations of cross-validation:

Hold-Out: The data is divided into test-data and training data and later compared. In Hold-Out method, we use only one set of training data that is kept on hold.

100 samples, 60 training, 20 test, and the 20 in validation dataset. During the training you calculate the accuracy of the model. Test is to test accuracy after training the model.

K-Fold Cross Validation: Here data is divided into k-sets. Then the first set or first fold is the validation data set, and the first fold is removed from the the total number of folds( where, suppose k=10). For each iteration, we take one fold for validation (the 9th, after the first iteration (k-1)) and then subtract it from the now remaining total sets of folds (now k=9). This method is effective yet requires huge computational power.

Example for k-fold cross validation with 10 folds

Leave-One-Out: This method is more painstaking as one-one data gets eliminated each time to test for n number of data points.

Dropout:

The drop-out method is used when working with neural networks in deep learning. Dropout is a technique that is old and proven to help the accuracy of models. which makes some activations in a layer deactivated (equals 0). We can choose any amount of data from the dataset to create a dropout layer. Usually, this is in the range of 20 or 30 percent. Suppose, if we use 30% dropout, then the activations for a random 30% neurons in that particular layer gets deactivated. The deactivated neurons will not be propagated to the next layer of the network. We do this to avoid overfitting, as more noise will make the model robust.

Dropout method: Here, some neurons have been deactivated( red colored, right). Suppose the activation is x, then in dropout it is equated to zero

Intuitively, this forces the network to be accurate even in the absence of certain information. The threshold for the deactivation is decided earlier.

Gradient Noise:

This method involves adding gradient noise during the training, a method that proved to have increased the accuracy of a model. Refer to this paper- Adding Gradient Noise Improves Learning for Very Deep Networks).

Adding noise sampled from Gaussian Distribution:

Regularisation:

Regularization is just yet another popular method of reducing the overfitting phenomenon. Used to resolve a problem of high variance, the technique involves penalising coefficients and weights, to get a higher accuracy for both training data and test data.

Here, w is a weight value, the red box represents the Regularization Term, and lambda is the Regularization Parameter, which get optimized during the training. The remaining is the loss function which calculates the least of the squares.

Vanishing and Exploding Gradients Problems

When training a deep neural network using back-propagation, you add new and new hidden layers to the network. This ends up in building a highly complex model, but compromises on speed of training. Here, the problem of vanishing gradients occur when using a sigmoid activation function or tanh activation function, two of the functions used to fire up neurons of a neural network, that determine how the gradients behave as they pass through the layers.

It happens that when the gradient for weight matrices are calculated and then subtracted from the very same matrices. However, if the model has a lot of layers, eventually some gradients equate to zero, therefore making it the weight values to not change, and they stop learning. However, this poses a problem as the model does not learn from these vanishing gradients that achieve nothing. Usually this effect of decreasing value of gradients increases as you backpropagate through the layers, thus making those earlier layers to stop learning.

Gradient Descent and Vanishing/Exploding Gradients

To be more clear, when using back propagation, if sigmoid activation function is employed which has values between 0 and 1. So, if a high value (>1) is generated, then the activation function will activate the value to 1, the during back propagation, the derivative becomes 0, thus completely missing higher values, and vice versa (low values [>0], stays constant at 0. To avoid such vanishing gradients, other activation functions such as ReLU, PReLU, SeLu and ELU are used.

A Tanh function

A sigmoid function. Notice that higher values beyond -6 and 6 remains constant, here

Answer to Vanishing and Exploding Gradients Problem

Activation functions — ReLU, PReLU, RReLU, ELU

ReLU: (Rectified Linear Unit) In order that values more than zero does not become invalid, ReLU marks it to infinity, thereby generating a linear function. However, ReLu is faulted for equating values lower than zero to zero, which is not so good in some cases as it misses those values altogether, but increase speed. And, when there is a saturation of values below zero, ReLU actually prevents any training at all.

ReLu

PReLU (Parametric Rectified Linear Unit): Better than ReLU, PReLU is effective by not deactivating values below zero, yet increasing the speed. It alleviates saturation by substituting values from 0.01 by a parameter ‘α’.

RReLU (Randomized Leaky Rectified Linear Unit) : RReLU is said to beat every one of the above activation functions. RReLU assigns random values to the negative slope, thereby not compromising on speed, or accuracy.

ELU (Exponential Linear Unit): ELU avoids saturation for values above zero by equating it to 1. Mostly employed for higher accuracy in classification, ELU speeds up training.

Refer to the article here for equations and detailed explanation of these functions.

Normalization:

Normalization solves the issues of overfitting, underfitting and vanishing gradient problems.

Batch Normalization: Batch normalization technique is used to improve the performance of back-propagation. It involves rescaling the input values to prevent them from becoming too big or small.

Instance Normalization: Instance normalization is a normalization which uses only a single sample, instead of a batch of samples, like in the batch normalization.

Multicollinearity

Multicollinearity occurs when there are multiple correlations between predictor variables in a model prediction. This phenomenon is one which most are familiar with and is very common in regression models. Problem with multicollinearity occurs only when you need to know why certain prediction happened, i.e., the reason for the prediction is needed.This can bring in explanations for any predictions by the model. Sometimes a heavily correlated column, can seem to be the causation of certain outcomes, when in reality they are only correlated.

Finding Multicollinearity within a dataset, can prevent seriously wrong conclusions about certain results, such as in pneumonia patients with asthma, who were thought to be better resistant to asthma, as they got treated sooner. However, the truth was that the asthma patients were given immediate care when affected by pneumonia simultaneously, as they are more prone to a fatal outcomes without immediate treatment.

Credits: creativewisdom.com

Answer to Multicollinearity

Autocorrelation & Partial Autocorrelation Tests: These are tests that can detect a correlation phenomenon in the model. They are usually used during Time Series Analysis, and Forecasting. With these tests you can detect where correlation occurs, and remove highly correlated columns.

Autocorrelation: It detects the correlation, or occurrence of repeated signals in the data, mostly in Time-Series Analysis and Forecasting. It can happen between two dependent variables, x1 and x2.

Principal Components Analysis (PCA):

Principal Component Analysis is used as to correct correlation errors. It simply keeps a set of new predictor variables that are a combination of the behaviors of variables that are highly correlated. So instead of dropping those correlated variables that have their own role in the model, these new variables retain the behaviors of those otherwise recurrent and correlated variables. It works through feature extraction.

Plot that analyses the Principal Components of a Dataset through Feature Extraction

Linear Discriminant Analysis(LDA):

LDA is used in predictive analytics problem.

Logistic Regressions are Classification Algorithms

It assumes that a set of new inputs will belong to the classes in the dataset collected up till now. The When using logistic regression, certain limitations such as instability of the model will occur. Instead, we can use LDA for linear regression. This algorithm also uses the famous Bayes Theorem to calculate the probabilities of inputs to the outputs.

P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))

Pearson correlation coefficient:

The Pearson coefficient is used to find the correlations between two variables X and Y. It gives a value between -1 and 1, that describes a negative or positive correlation, and if the value gives zero, there is no correlation.

Autocorrelation & Partial Autocorrelation Tests:

Autocorrelation tests the degree of (relationship) correlation of an outcome to the variables. The AutoCorrelation Function (ACF) is used to calculate the correlations in a Time Series. The observations for a Time Series prediction is correlated with the already collected Time Series observations. Hence, the name autocorrelation. The aim of the ACF is to use the values plot the graph for all the correlations with a lag. The lag here is the term that is calculated by taking the value required to make the series stationary, from the terms present in previous time series observations.

About Mate Labs

At Mate Labs we have built Mateverse, a Machine Learning Platform, where you can build customized ML models in minutes without writing a single line of code. We make the jobs of Analysts and Data Scientists easier, with proprietary technologies vis a vis, Complex pipelines, Big Data support, Automated Data Preprocessing (Missing Value Imputation using ML models, Outlier Detection, and Formatting), Automated Hyperparameter Optimization, and much more.

To help your business adopt Machine Learning in a way that won’t end up wasting your team’s time in data cleaning and in creating effective data models, fill up the TypeForm here, and we will reach out to you.

Read more about our product. Feel free to reach us out at [email protected]

Reach Out To Us

Also to receive helpful articles like this one, follow us here, on Medium, LinkedIn and Twitter.

About the Author

Raven S Daniel

Find her on Medium, here. Look out for more industry topics, and updates on LinkedIn.