So, you landed up here after scavenging over the internet to understand Linear Regression and I must say that your search ends here.
So without further waiting, let us just jump to it.
“Generalization is the ultimate goal of any machine learning algorithm”
Two major types of problems that machine learning algorithms try to solve are:
Linear Regression, intuitively is a regression algorithm with a Linear approach. We try to predict a continuous value of a given data point by generalizing on the data that we have in hand. The linear part indicates that we are using a linear approach in generalizing over the data.
Examples makes it easy to understand, so suppose you want to predict the price of a house by knowing its size. You have the data which has some house prices and corresponding sizes. Charting the data and fitting a line among them will look something like this:
To generalize, you draw a straight line such that it crosses through the maximum points. Once you get that line, for house of any size you just project that data point over the line which gives you the house price.
That is it. You are done. But wait!
The problem was never finding the house price. The problem was to find the best fit line which generalizes over the data well.
The same old line equation is used: y = mx +c with a little statistical look and some below terminologies are added specific to Linear Regression modeling.
We will go over the elements of the equation one by one:
Seems like we can easily come to best parameter values by hit and trial method and thus finding the best fit line. But things are not that easy when the problem you are solving has more dimensions which is also known as the The Curse of Dimensionality.
Multiple Linear Regression
The Multiple Linear Regression will have n number of features and we have to find a line which fits all the data points for all the dimensions. You have started to realize that it is no longer a hit-and-trial solution.
We can draw infinite number of lines but never be sure that the line is the best fit or not, so here enters our savior: The Cost Function
“Mathematics is the queen of Science”
I have been running from mathematics my whole life, but you see, you just cannot escape from it. It’s really intuitive and gives you the ability to see what’s going on under the hood.
The Cost Function is a mathematical construct which is calculated by adding up the squared error terms
Mean Squared Error
By tweaking the above equation a little we get the below equation:
where J is cost notation and θ is the parameter
Our goal is to reduce the Cost function, which in turns improves the accuracy. We can take the parameter values by hit and trial method and calculate MSE for each combination of parameters, but that is not an efficient technique. Seems like there is a calculus solution to this problem.
Gradient Descent is a simple optimization technique for finding minimum of any function, in this case we want to find the minima of our MSE function.
Gradient Descent
You can visualize the function as a valley, and you are standing on some random point. Your aim is to reach the bottom most point of the valley and Gradient Descent help you in reaching to the bottom most point of the valley.
Here is how you will do it:
You will see and make out where the slope of the valley is going down, then you start taking steps towards the down going slope. Once you see that every contour around you is higher than where you are standing you claim that you have reached the bottom most point.
Here is how Gradient Descent will do it:
It has to know where the slope of the valley is going (and it doesn’t have eyes as you do) so it takes the help of mathematics here.
To know the slope of a function at any point you differentiate that point with respect to its parameters, thus Gradient Descent differentiates the above Cost function and comes to know the slope of that point.
To go to the bottom most point it has to go in the opposite direction of the slope, i.e. where the slope is decreasing.
It has to take small steps to move towards the bottom point and thus learning rate decides the length of step that gradient descent will take.
After every move it validates that the current position is global minima or not. This is validated by the slope of that point, if the slope is zero then the algorithm has reached the bottom most point.
After every step, it updates the parameter (or weights) and by doing the above step repeatedly it reaches to the bottom most point.
Once you have reached the bottom most point of the valley that means you have got the parameters corresponding to the least MSE or Cost function, and these values of parameters will fetch you the highest accuracy.
You can now use your Linear Regression model to predict the dependent variable of any unforeseen data point with very high accuracy.
If you are still here me with then you have got the concepts of Linear Regression and you are ready to apply it to solve real world problems which matter.
In the upcoming posts we will deep dive into the maths part and code a linear regression model using python.
Let me know if I can improve this post in any aspect by giving your feedback in the comments section below or by mailing me at [email protected]