This guide is the second part of a series of tutorials for machine learning using Python and R, you can find part 1 here.
What is linear regression? (from Wikipedia)In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variabley and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. (This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.)
Basically, it’s this function from middle school, Y = mx + b
Where Y
is the value on the y-axis that will be equal to the slope multiplied by the value of x
plus the constant b
.
Let’s start by creating a model to predict the tip a waiter will get based on his or her client’s age (I will address the issues with this specific model below). To fit our dataset into a linear regression algorithm X
will be age and Y
will be tip amount. Later I will demonstrate how to improve our model by using group size instead of age. In that example X
will be group size and Y
will be tip amount
This is just an example to demonstrate how to build a simple model using linear regression.
Age and tip do not have a strong correlation and there are a number of variables to take into account such as total number of diners, location, type of restaurant, meal-type, etc..
The dataset:
This random dataset, it isn’t designed to make sense.
Python Example
R example
Here’s the graph
This graph is our train dataset
Test dataset graph
**What can we learn from the graph?**As you probably guessed there is a weak correlation between tip amount and the client age. As data-driven people we should always check all the possibilities and try not to be biased. Now let’s try to check a different dataset tip
vs a number of diners
.
Using a new dataset: number of diners and tip
Group size vs Tip
Now let’s crunch the numbers using our linear regression model
We are going to use the same code (see example above), but I’ll show you the new train dataset and test dataset graph.
How much a group of X will tip (Train set) — stronger correlation
How much a group of X will tip (Test set) — we can see that our test set fit to our train set
So, we see that there is a much stronger correlation between group size tip amount than that of tip amount and client age.
We can run a query on our model to predict the tip amount for a group of 5 diners using this command: my_test_tip = regressor.predict(np.array(5))
**Summary**This is a basic model meant only to demonstrate the basics of linear regression and how to create a model. In real life we use more variables, making our models much more complicated.
I’ve been putting some time into these tutorials, so I hope you guys find it useful. If you want more, press on claps show your support! or buy me a beer by sending some bitcoin love 1Pg4BbrevSEWroo6zS6Kyvi1EMffpAgLac
Cheers!For more Doron Segal checkout my site http://segaldoron.com