Dataset is one important part of the machine learning project. Without data, machine learning is just the machine, and learning is stripped from the title. Which the machine and deep learning people will not stand and they will protest for the data 😅 .
Being data this important, when we have a dataset containing a few hundred or fewer data rows, we must keep some part of the data for testing purposes. Once the software is ready, we must test it before putting it on the market where it would make decisions and affect people’s lives. so the model is not expected to make errors or if, then least.
Thus testing becomes an important part of the development and in machine learning since we are working with data so we must have some data for testing purposes also.
The purpose of testing data is such that, once kept aside, this data should never be made available to the machine learning model until the training has been finished, and then and only then should we introduce the testing data to the machine learning model that was trained on some data of which this testing data is part of.
Since we have understood why
the testing data, let us see How
do we keep aside this data for testing purposes?
we will work with the california dataset
from Kaggle, we will load the dataset with pandas
and then make the spliting
. We can do the splitting in two ways:
Manual by choosing the ranges of indexes
Using a function from Sklearn
.
import pandas as pd
dataset = pd.read_csv(housing.csv)
dataset.head()
After the data has been loaded, we can see using the head()
, first 5 rows of the dataset to get a glimpse of the dataset. Now as we have the full dataset in the dataset variable, we are ready to make the split. But first, let us see the length of the dataset so that we can see this length again after making the split
and verify that the split was successful.
len(dataset)
20640
This simple function len()
gives us the number of rows that are present in the dataset which is 20640
Now we will see how to make the split manually using the indexing
.
Here in this method, the idea is we want to specify up to which index do we want the training
and from which index do we want the testing set
or vice versa.
The syntax is simple and straightforward: dataset_variable[strat:end]
train = dataset[:16000]
test = dataset[16000:]
In this above example, we are simply saying the following:
After running this line of code, if we now see the length of train
and test
, we will get the 16000
and 4640
.
len(train)
16000
len(test)
4640
The thing is here we have to know the indexes up to which or from which we want the split, also one other disadvantage is that we can’t shuffle data rows here. If we have to do the shuffle, we have to add another step to the process. That is before making the split, we have to manually shuffle the dataset and then make the index-based splitting.
Now when we are using the sklearn, these steps are just one parameter away and we don’t need to worry about the indexes and other stuff.
Now let us see how to perform the same using Sklarn’s train_test_split
function.
from sklearn.model_selection import train_test_split
So, this awesome and heavenly function is imported from the sklearn’s model_selection
module and is simple as a cake. After importing, all we need to do is pass the dataset and the size test set
that we want and we are done.
The thing to keep in mind is if we give the dataset in DataFrame
form, we get two dataframes
in return, one for training
and one for testing
.
training, testing = train_test_split(dataset, test_size=0.3, shuffle=True, random_state=32)
we have given the following parameters to this function:
Dataset
– a whole dataset that we have.test_size
– the percent of data that we want for the test set.shuffle=True
, whether we want our dataset to be shuffled before making the split or not. If True
, the indexes will be shuffled and then the split will be made.random_state=any_number
– if set, then it does not matter how many times we make the split, will get the same split every time.
By using this, we make sure that the test set remains always safely protected from the model.
Let us see what is in the training and testing variables that were returned by this function. training contains the part of the dataset, with 14448 rows and the indexes are shuffled also.
training
14448 rows × 10 columns
In a similar way, testing contains 6192 rows and also shuffled indexes. If we calculate 30%
of this dataset, it comes the 6192. Thus 30%
of our data is in the testing set.
testing
6192 rows × 10 columns
Now if we separate the input and output variables into x as input and y as output. Then in this dataset, we would be having median_house_value in y as it is the output variable and the rest columns will be in x
as input to the model.
This x
will the input to the model, on which the model will be trained along with the y
. But in the future when the model will be making predictions, it would be predicting the y
based on x
. Let us split the dataset into x
and y
.
x = dataset.drop("median_house_value", axis=1)
Here we are saying, drop the median_house_value
along the axis=1
i.e. whole column or in the direction of columns.
Note: axis=0
means row and axis=1
is columns.
Thus we would be having x
in which there will be no median_house_value
and all other columns will be present.
x
20640 rows × 9 columns
Which indeed is the case.
Now let us create our y
by writing the following code.
y = dataset['median_house_value']
Here we are saying, get this column named median_house_value
and put that in the y
. Simple as it sounds.
y
0 452600.0 1 358500.0 2 352100.0 3 341300.0 4 342200.0 Name: median_house_value, Length: 20640, dtype: float64
Thus we have x
and y
.
Now if we pass the x
and y
in the train_test_split
instead of the dataset as a whole, we will get 4 variables back in return. Two variables for the training set are input for training and output for training and similarly, Two for the testing set, and those are input for testing and output for testing.
The below code shows the output of the function with x and y
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, shuffle=True, random_state=32)
Let us see the len
of train_x
and train_y
and test_x
and test_y
.
print(fThe length of Train_x: {len(train_x)} and length of train_y: {len(train_y)})
The length of Train_x: 14448 and length of train_y: 14448
print(fThe length of Test_x: {len(test_x)} and length of test_y: {len(test_y)})
The length of Test_x: 6192 and length of test_y: 6192
Which is indeed the same as the above, when we gave the whole dataset.
In this post we tried to understand the need for splitting the data and How do we perform this split in an efficient way.
We learn the manual as well as the modern, sklearn’s way, which most machine learning practitioners do. I hope you understood the process and will now be able to do the same and understand the Why and How of the process.
This was all, Thank you. Please leave a comment and also share it.