### Supervised Machine Learning — Linear Regression in Python\n\n Update [17/11/17]: The full implementation of Supervised Linear Regression can be found here.\n\n### Introduction\n\nThe concept of machine learning has somewhat become a fad as late, with companies from small start-ups to large enterprises screaming to be technologically enabled through the quote on quote, integration of complex automation and predictive analysis.\n\nThis has lead to an impression that machine learning is highly nebulous, with systems on integration beyond the comprehension of the general public. This is far from the truth. Rather, I view machine learning more towards the field of computational statistics instead of some enigmatic black box, where someone waves a wand, abracadabra, and is able to conjure up some magical prediction.\n\n### Motivation\n\n_The motivation of this series is to enable anyone who is interested in the field of machine learning to be able to develop, understand and implement their own machine learning algorithms._\n\nThis will be a first of a series of articles which I plan to write; I hope you enjoy.\n\nA few important terminologies before we start:\n\n* **Independent Variables (features):** An independent variable is a variable that is manipulated to determine the value of a dependent variable. Simply, they are the features which we want to use to predict some given value of Y. It can be also called an explanatory variable\n* **Dependent Variable(target):** The dependent variable depends on the values of the independent variable. Simply put, it is the feature which we are trying to predict. This can also be commonly known as a response variable.\n\n### Simple and Multiple Linear Regression\n\n_What is your expected income from your years of education? What is expected final exam results given your previous marks? What are your chances of winning over the girl of your dreams (I’m kidding)_\n\nNonetheless, linear regression is one of the strongest tools available in statistics and machine learning and can be used to predict some **value (Y)** given a set of **traits or features (X)**.\n\nGiven that it is such a powerful tool, it is a great starting point for individuals to who are excited in the field of Data Science and Machine Learning to learn about, _‘How machines learn to make predictions’._\n\nTo illustrate how linear regression works, we may examine a common problem students face when attending university. What is my expected final exam mark, given my previous results in the subject? This problem can be mathematically defined as some function between our **independent variables (X)** and the corresponding **final exam mark (Y)**.\n\n!(https://hackernoon.com/hn-images/1*6T7Y8lpS5K-LSoSwAwNM9A.png)\n\n X (input) = Assignment Results\n Y (output) = Final Exam Mark\n f = function which describes the relationship between X and Y\n e (epsilon) = Random error term (positive or negative) with a mean zero (there are move assumptions for our residuals, however we won't be covering them)\n\nFrom experience, you may come to the derivation that, ‘If my assignment mark is 73%, that I generally score a mark of 1.1x for final exam mark, which is approximately 80.3%’. While, this may be true, such an approximation method is rather unorthodox, and lacks accuracy as us humans are implicitly biased in our approximations. Furthermore, this comes increasingly difficult to predict as we add more independent variables.\n\nComputers on the other hand are optimised to perform extremely well when provided a set of logical sequences, as they do not suffer from biases in comparison to humans. Moreover, computers are more efficient in both accuracy and computational speed; we can use computers to our advantage in predicting our desired feature which we are interested in understanding.\n\nFor our example, we will be using a supervised “training and test” set to predict a student’s expected final exam result predicated on their assignment scores. We will achieve this through splitting our data set into a training and test set. The purpose of the training set is to enable the machine to learn the relationship between the a student’s assignment results and their respective final exam mark. In doing so, we can then use the learned function to estimate a student’s final exam result, and apply it to our unlabelled test set in order to predict a student’s expected final exam score.\n\n!(https://hackernoon.com/hn-images/1*hW9B8_ggbo3GTFIuEn86Eg.png)\n\n Regression\n\n Y = f(X) + e, where X = (X1, x2...Xn)\n\n Training: Machine learns (fits) f from labelled training set\n\n Test: Machine predicts Y from unlabeled test set\n\n Note: f(x) can be derived through matrices to perform least square linear regression. However this beyond the scope of this tutorial, if you'd like to learn how to derive regression lines here is a good link . Also, X can be a tensor with any number of dimensions. A 1D tensor is a vector (1 row, many columns), 2D tensor is a matrix (many rows, many columns), and higher dimensional tensor.\n\nFor simplicity, we will only use one independent variable(assignment) in predicting our estimated final exam score, which is a **2D tensor**.\n\n### Linear Regression (Ordinary Least Squares)\n\n> How to predict the future by drawing a straight line. Yes, this counts as Machine Learning.\n\nThe objective of ordinary least square regression (OLS) is to learn a linear model (line) in which we can use to predict **(Y)**, while consequently attempting to reduce the error (error term). By reducing our error term, we inversely increase the accuracy of our prediction. Thereby, improving our learned function.\n\n!(https://hackernoon.com/hn-images/1*DZb0D9Unu22fvNJP1pjlSA.png)\n\nSource: [Wikipedia ’Linear Regression’](https://commons.wikimedia.org/wiki/File:Linear_regression.svg)\n\nSince linear regression is a parametric method: where the sample data comes from a population that follows a probability distribution based on a fixed set of parameters. As such, various assumptions must be satisfied about the form of the function relating X and Y — see in the attached notes for further reading. Our model will be a function that predicts y-hat given a specific x:\n\n!(https://hackernoon.com/hn-images/1*aSvQLKZcCvYQKaI8otCRWQ.png)\n\nThis can be interpreted as, for a one unit increase in X, holding all else constant, Y increases _β1_\n\n#### Interpretation\n\n**_β0_:** is the y-intercept when x = 0. i.e. when your assignment results is 0, your predicted final exam mark is the Y intercept ( _β0 )_\n\n**_β1_:** Is the slope of our line. i.e. how much does our final exam mark increase for a one unit increase in your assignment mark.\n\nReminder, our objective is to learn the model parameters which minimises our error term, thereby increasing the accuracy of our model’s prediction.\n\n> To find the best model parameters:\n\n> 1\\. Define a cost function, or loss function, that measures how inaccurate our model’s prediction is.\n\n> 2\\. Find the parameter that minimises loss, i.e. make our model as accurate as possible.\n\nGraphically this can be represented in a Cartesian plane, as our model is two dimensional. This would change into a plane for three dimensions, etc…\n\n!(https://hackernoon.com/hn-images/1*8XasZ3p6hHsTbAsK1nSclA.png)\n\nWhere Y is our Final Exam Mark, and X is our Assignment Mark\n\n> **Note for dimensionality:** our example is two-dimensional for simplicity, however, this is unrealistic and typically you will have more features (x’s) and coefficients in your model, as generally you will have more than one feature that is significant in explaining your dependent variable. In addition, linear regression suffers enormously from the [**curse of dimensionality**](https://stats.stackexchange.com/questions/169156/explain-curse-of-dimensionality-to-a-child), as once we deal with high-dimensional spaces, every data point becomes an outlier.\n\n#### Cost Function\n\nThe formula for the cost function may be daunting for you at first. However, it is extremely simple and intuitive to understand.\n\n!(https://hackernoon.com/hn-images/1*dyzQshni7wqRYFb01AoxKg.png)\n\nCost Function (Error Term) of our linear model\n\nSimply, the cost function says to take the difference between each real data point **(y)** and our model’s prediction **(ŷ),** square the differences to avoid negative numbers and penalise larger differences. Finally, add them up and take the average. Except rather than dividing it by n, we divide it by 2\\*n. This is because mathematicians have decided that it is easier to derive. Feel free to take this to the mathematics court of justice. However, for simplicity just remember that we take 2\\*n.\n\nFor problems that are 2 dimensional, we can could simply derive the optimal beta parameters that minimise our loss function. However, as the model grows increasingly complex, computing the beta parameters for each variable becomes no longer feasible. As such, a method known as **Gradient Descent** will be necessary in allowing us to minimise our loss function.\n\n### Gradient Descent: Learning our Parameters\n\n!(https://hackernoon.com/hn-images/1*lbvLg7MSu05pZkmHUAiTsQ.png)\n\n[Source: Youtube](https://www.youtube.com/watch?v=riplXsNf_zs) [Mwamba Capital](https://www.youtube.com/channel/UClR00RTNrUcjr3l36kyTzgA)\n\n> Imagine you’re standing somewhere on a mountain **(point A)**. You want to get as low as possible as fast as possible, so you decide to take the following steps:\n\n> \\- You check your current altitude, your altitude a step north, a step south, a step east, a step west. Using this, you figure out which direction you should step to reduce your altitude as much as possible in this step. \n> \n> — Repeat until stepping any direction will cause you to go up again **(point B)**.\n\n> This is Gradient Descent\n\nCurrently, gradient descent is one of the most popular methods used for parameter optimisation. It is often used with Neural Networks, which we will cover later. Nonetheless, it is important to understand _what it does_, and _how it works_.\n\nThe goal of gradient descent is to find the minimum point of our model’s cost function by iteratively getting a better and better approximation. This is achieved through iteratively, moving in which direction the ground is sloping down most steeply, until stepping any direction will cause you to go up again.\n\n!(https://hackernoon.com/hn-images/1*UA1jE11ju6__Jl5bUgcyYw.gif)\n\n[Source: PyData](http://songhuiming.github.io/pages/2017/05/13/gradient-descent-in-solving-linear-regression-and-logistic-regression/)\n\nTaking a look at our loss function we saw in regression:\n\n!(https://hackernoon.com/hn-images/1*dyzQshni7wqRYFb01AoxKg.png)\n\nWe see that this is really a function of two variables: _β0_ and _β1_. All the rest of the variables are determined, since X, Y and N are given during training. Hence, we want to try and minimise this function.\n\n!(https://hackernoon.com/hn-images/1*wsBakfF2Geh1zgY4HJbwFQ.gif)\n\nSource: [Github Alykhan Tejani](https://alykhantejani.github.io/a-brief-introduction-to-gradient-descent/)\n\nThe function is ( _β0_ , _β1_ ) = _z_. To begin gradient descent, we start by making a guess of the parameters B0 and B1 that minimise the function.\n\nNext, we take the partial derivative of the loss function with respect to each beta parameters: \\[_dz_/d_β0_, _dz_/d_β1_\\]. The partial derivative indicates how much total loss increased or decreased if you increase _β0_ or _β1_ by a very small amount. If the partial derivative of _dz_/d_β1_ is a negative number, then increasing _β1_ is good as it will reduce our total loss. If it is a positive number, you want to decrease _β1_. _If it is zero, we don’t change β1, as it means we have reached optimum._\n\nWe do this until we reach the bottom, i.e, the algorithm converges and the loss has been minimised.\n\n### Overfitting\n\n> **Overfitting:**”Sherlock, your explanation of what just happened is too specific to the situation”. **Regularisation:** “Don’t overcomplicate things, Sherlock.” I’ll punch you for every extra word. **Hyperparameter ( _λ_ )**:”“Here’s the strength with which I will punch you for every extra word” — [Credits to Vishal Maini](https://medium.com/@v_maini)\n\nOverfitting occurs when the trained model performs too well on the training data that the model learned on, but does not generalise well on the test data. The problem of overfitting is not limited to computers, humans are often no better. For instance, say you had a bad experience with an XYZ Airline, maybe the service wasn’t good, or that the airline was riddled with delays. You might be tempted to say that all flights on XYZ airline sucks. This is called **overfitting** whereby we overgeneralise something, which otherwise, might have been us just having a bad day.\n\n!(https://hackernoon.com/hn-images/1*3O3Ib-4DCvEDONHau9A5ZA.png)\n\nSource: [Quora: Luis Argerich](https://www.quora.com/What-is-overfitting)\n\nOverfitting occurs when a model _over-learns_ from the training data to the point where it starts picking up idiosyncrasies that aren’t representative of patterns in the real world. This becomes especially problematic as you make your model increasingly complex. Underfitting is related to the issue where your model is not complex enough to capture the underlying trends in the data.\n\n!(https://hackernoon.com/hn-images/1*zA8baiaM4HIhAa5Kpn94Hw.png)\n\nSource: [Scott Fortmann-Roe](http://scott.fortmann-roe.com/docs/BiasVariance.html)\n\n Bias-Variance Tradeoff\n\n Bias: is the amount of error introduced by approximating real-world phenomena with a simplified model.\n\n Variance: is how much your model’s test error changes based on variation in the training data. It reflects the model’s sensitivity to the idiosyncrasies of the data set it was trained on\n\n As a model increases in complexity and it becomes flexible, its bias decreases (it does a good job of explaining the training data), but variance increases (it doesn’t generalise as well).\n\n> Ultimately, in order to have a good model, you need one with low bias and low variance\n\nRemember, the only thing we care about is _how_ well the model performs on the test data. You want to predict the mark of a student’s final exam result, before they are marked, not just build a model that is 100% accurate in classifying a students mark based of the training set.\n\n**Two ways to combat overfitting:**\n\n1\\. **Use more training data:** The more you have, the harder it is to overfit the data by learning too much from any single training example.\n\n2\\. **User regularisation:** Add in a penalty in the loss function for building a model that assigns too much explanatory power to any one feature or allows to many features to be taken into account\n\n!(https://hackernoon.com/hn-images/1*rFT6mtU45diT0OJhlgDcBg.png)\n\nThe first piece of the sum above is our normal cost function. The second piece is a **regularisation term** that adds a penalty for large beta coefficients that give too much explanatory power to any specific feature.\n\n> With these two elements in place, the cost function now balances between two priorities: explaining the training data and preventing that explanation from becoming overly specific.\n\nThe lambda coefficient of the regularisation term in the cost function is a **hyperparameter:** a general setting of your model that can be increased or decreased (i.e. **tuned**) in order to improve performance. A higher lambda value harshly penalises large beta coefficients that could lead to potential overfitting. To decide on the best value of lambda (_λ_), you’d use a method known as **cross-validation** which involves holding our a portion of the training data during training, then seeing how well your model explains the held-out portion. We’ll go over this in more depth in future series'.\n\n For those whom are interested in the full implementation of Supervised Learning: Linear Regression. It can be found here \n\n### Wooh!! You’ve just covered the essentials of Supervised Learning: Linear Regression!\n\nHere’s what we covered in this section:\n\n* How **supervised data** can be used to enable computers to **learn a function** without explicitly being programmed.\n* **Linear Regression**, the fundamentals of **parametric** algorithms\n* Learning **parameters** with **gradient descent**\n* **Overfitting** and **regularisation**\n\n### Further Reading & Practice\n\nFor a more comprehensive understanding of linear regression, I recommend ‘**Elements of Statistical Learning’** by Trevor Hastie. It’s a fantastic book which covers the essentials of Statistical Learning using Linear Algebra.\n\nFor logistic regression, I recommend ‘**Applied logistic regressio**n’ by David W. Hosmer. It goes into much more detail in the different methods and approaches available for logistic regression.\n\n**_For Practice:_**\n\nFor practice, I recommend playing around with datasets used to predict housing prices, [**Boston housing data**](http://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) is the most popular. Otherwise, consider [**salary prediction**](https://www.kaggle.com/c/job-salary-prediction/data)**.**\n\nFor Datasets to considering implementing supervised learning: Linear regression. I recommend going on [**reddit/datasets**](https://www.reddit.com/r/datasets/) or [**Kaggle**](https://www.kaggle.com/kernels), to practice and see how accurate of a prediction you can create.