Machine learning is nothing but an optimisation problem. Researchers use an algebraic acme called “Losses” in order to optimise the machine learning space defined by a specific use case. A “Loss” can be seen as a distance between the true values of the problem and the values predicted by the model. The greater the loss is, the more huge the errors you made on the data. Most of the performance evaluation metrics such as accuracy, precision, recall, f1 score etc are an indirect derivation of the Loss functions. There are a lot of loss functions implemented by the researchers like-
In this article, I will introduce you to a sophisticated loss metric called “Hinge loss” which is discussed in some of the most recommended textbooks regarding predictive modelling. I hope the explanation will be in a lucid manner, both visually and mathematically to help beginner enthusiasts in the machine learning field.
Hinge loss is a function popularly used in support vector machine algorithms to measure the distance of data points from the decision boundary. This helps approximate the possibility of incorrect predictions and evaluate the model's performance.
Some of the other popularly used loss functions in classification algorithms are-
The support vector machine is a supervised machine learning algorithm that is popularly used for predicting the category of labelled data points.
For example-
SVM uses an imaginary plane that can travel across multiple dimensions for its prediction purpose. These imaginary planes which can travel through multiple dimensions are called hyperplanes. It is very difficult to imagine higher dimensions using human brains since our brain is naturally capable to visualize only up to 3 dimensions.
Let’s take a simple example to understand this scenario.
We have a classification problem to predict whether a student will pass or fail the examination. We have the following features as independent variables-
So, these 3 independent variables become 3 dimensions of a space like this-
Let’s consider that our data points look like this where-
The green colour represents the students who passed the examination
The red colour represents the students who failed the examination
Now, SVM will create a hyperplane that travels through these 3 dimensions to differentiate the failed and passed students-
So, technically now the model understands that every data points which falls on one side of the hyperplane belong to the students who passed the exams and vice versa. This hyperplane is called the decision boundary or maximum margin hyperplane. The distance from a data point to the decision boundary shows the strength of the prediction.
The following image shows better visualization-
Logically,
The following image will give you a better intuition -
Here,
The value of the decision boundary is zero.
There are 2 primary scenarios where the researchers use hinge loss in SVM-
Scenario 1 (In training data): To optimally build a model in multi-dimensional space which reduces the misclassification and strengthens the decision-making ability.
It also helps to build the best-fit decision boundary by selecting that decision boundary that has the minimum hinge loss out of the many options via trial and error method or hyperparameter tuning (This approach is similar to the process of finding the best fit line in the linear regression during the training process).
Scenario 2 (In testing data): To evaluate the performance of the SVM model.
Let us understand the calculation of hinge loss in SVM with respect to scenario 1. The below image is a visual representation of the hinge loss function.
The dotted line represents the number 1 on X-axis. If a data point is correctly predicted by the model and its distance from the decision boundary is greater than 1 then the loss is very minimal (nearly zero).
If the data point is placed exactly at the decision boundary then the hinge loss will have a value of 1 (Obviously, the distance between the decision boundary and the data point will be zero).
If a data point is incorrectly predicted (classified) by the model then there are 2 possibilities-
Possibility 1: The distance between the data point and the decision boundary is in a positive direction (Data point 1 in the above image).
Possibility 2: The distance between the data point and the decision boundary is in a negative direction (Data point 3 in the above image).
In possibility 1, the hinge loss will not increase rapidly i.e. the loss value will be below.
For example,
Let’s assume that the value of the decision boundary is 0 and the value of the predicted data point is +2.5 (but the actual value should be below 0 since it is a wrong prediction). Here, the difference between the predicted value and decision boundary is +2.5. Hence, the hinge loss will be low (This is depicted as data point 1 in the above image).
In possibility 2, the hinge loss will be increasing rapidly i.e. the loss value will be high.
For example,
Let’s assume that the value of the decision boundary is 0 and the value of the predicted data point is -1.5 (but the actual value should be above 0 since it is a wrong prediction). Here, the difference between the predicted value and decision boundary is -1.5. Hence, the hinge loss will be high (This is depicted as data point 3 in the above image).
Let us calculate the hinge loss for these 2 possibilities-
As we discussed earlier,
The value of hinge loss is 1 when the value of the predicted data point is 0.
According to possibility 1, if the value of the predicted data point is +2.5 then the hinge loss of that prediction is zero.
According to possibility 2, if the value of the predicted data point is -1.5 then the hinge loss of that prediction is 2.5.
Similarly, the hinge loss for every data point can be calculated for a model. When an SVM model is constructed in a multidimensional plane, we should try always try to minimize the hinge loss as much as possible to increase the predictive ability of the model.
Imagine that we have a binary classification problem to predict whether a student will pass/fail the examination based on the following predictor variables-
We trained our model with 1000 records and now we have the following table as the test data-
We evaluate the model using the following test data and make predictions. Our predictions are as follows-
The predicted value will be a number between -1 and 1 with a margin of 0.2. If the value is less than or equal to zero then the predicted class will be considered -1 and if the value is greater than zero then the predicted class will be considered +1.
Since the margin is 0.2 and the decision boundary is 0. The hinge loss is calculated for –
All incorrect predictions
All correct predictions within the range of [-0.2, +0.2]
The Hinge loss is as follows-
Hinge loss- although might initially look a little bit complicated, I hope you have got a fundamental intuition about this concept through this article. A lot of complex use cases can be optimally solved using this technique that demands a binary classification algorithm (especially Support Vector Machines). This metric is available as an inbuilt library in most data science-oriented programming languages such as Python or R. Hence, it is easy to implement this once you understand the theoretical intuition. I have added the links to some of the advanced materials in the references section where you can deep dive into the complex calculations if you are interested.