How do I know which model to choose for my machine learning problem?

Written by thems18 | Published 2018/07/20
Tech Story Tags: data-science | machine-learning | data-scien | dataset | visualization

TLDRvia the TL;DR App

“Data are becoming the new raw material of business.”

Hello friends, today I am going to tell you the way by just seeing the Dataset how would you know which model I have to choose.

So, let’s get started ….!

What is Data-set?

A data set (or data-set) is a collection of data. Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question.Let’s see some data-set which is in the form of a .csv file.

Jupyter Notebook

Assume we have to work on this data-set in which many columns and rows are there. Your first step is to identify your Independent and Dependent Variable in the data-set.

A dependent variable(generally referred to last column in the dataset i.e here the last column is SalePrice) is the variable being tested and measured in a scientific experiment. An **independent variable(Rest all other variables are Independent variable like Street, LotShape, SaleCondition etc.)**is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable.

Now, we have seen how data-set looks like

What you need to know is, whether your problem is a Regression problem or Classification problem or Clustering problem.

So, for that, you need to look at the dependent variable which we now already know what is dependent variable?[Note: If you don’t have dependent variable then it means it is Clustering Problem.]

Let’s see how dataset looks like without DV(Dependent variable)

This data was collected on our social survey mobile platform Whatsgoodly. We have 300,000 millennial and Gen Z members and have collected 150,000,000 survey responses from this demographic to date.

Now, if your data-set contains a Dependent variable, then you have to see if it has the Continuous outcome or a Categorical outcome.

If it is a Continuous outcome then your problem is a Regression Problem.

And if it’s a Categorical outcome then your problem is a Classification problem.

Let’s see how dataset looks like with DV(Dependent variable)

Regression Case :

This is a House Prices Data-set and in this dataset, there are lots of rows and columns are there. And you have to predict the SalePrice which is the Dependent variable, however, rest all others are independent variables. You can easily see it is Regression problem and we have to use some Regression Model on it like -RandomForest, SVR etc.

Jupyter Notebook

Classification Case:

Now, see this dataset in which you have given User ID, Gender, Age, Estimated Salary which all are Independent Variable and you have to predict whether if some new person comes they going to buy new SUV car or not. [Note: One can easily see it is classification problem because the dependent variable which is Purchased one having binary output 0 or 1 only, where 1 means it will go to buy the SUV and 0 means not going to buy the SUV.]

So, till now we got enough idea by just seeing the dataset we can classify our problem into Regression or Classification or Clustering.

Now, how would I know which model is the best one like for example you are working on Home Price Prediction and you have to predict the price of the house based on the several parameters. But, which model should I use or what parameters should I have to insert into that. See, all you can do is use Grid Search for that which provide you which parameters is best for your model.

What does the Grid Search do?

It will find the optimal values for your model like which parameters should to choose. All you need to do is import the class from the Sklearn library.

from sklearn.model_selection import GridSearchCV

Nobody can tell you in this World which model will give you the best performance or accuracy by just seeing the dataset. All you can do is classify your problem by seeing the dataset whether the dataset is linear or non-linear and the model problem is classification, regression or clustering problem.

Don’t be sad because you will have the cheat sheet, which helps you detect the model.

Scikit Learn

If you find any difficulty in reading the cheat sheet go to this link Cheat Sheet.

I hope you like this article!! If you have any problem or query in any topic related to Data Science then do let me know in the comment Section!! I’ll share more concepts soon on LinkedIn.com Article column as well as Medium.

Give some love too!​

_Mohit Sharma(themenyouwanttobe&Co.)_[email protected]/ Telegram


Published by HackerNoon on 2018/07/20