Why and How do We Split the Dataset

Dataset is one important part of the machine learning project. Without data, machine learning is just the machine, and learning is stripped from the title. Which the machine and deep learning people will not stand and they will protest for the data 😅 .

Introduction

Being data this important, when we have a dataset containing a few hundred or fewer data rows, we must keep some part of the data for testing purposes. Once the software is ready, we must test it before putting it on the market where it would make decisions and affect people’s lives. so the model is not expected to make errors or if, then least.

Thus testing becomes an important part of the development and in machine learning since we are working with data so we must have some data for testing purposes also.

The purpose of testing data is such that, once kept aside, this data should never be made available to the machine learning model until the training has been finished, and then and only then should we introduce the testing data to the machine learning model that was trained on some data of which this testing data is part of.

Since we have understood why the testing data, let us see How do we keep aside this data for testing purposes?

we will work with the california dataset from Kaggle, we will load the dataset with pandas and then make the spliting. We can do the splitting in two ways:

Manual by choosing the ranges of indexes

Using a function from Sklearn.

import pandas as pd 
dataset = pd.read_csv(housing.csv) 
dataset.head()

After the data has been loaded, we can see using the head(), first 5 rows of the dataset to get a glimpse of the dataset. Now as we have the full dataset in the dataset variable, we are ready to make the split. But first, let us see the length of the dataset so that we can see this length again after making the split and verify that the split was successful.

len(dataset)

20640

This simple function len() gives us the number of rows that are present in the dataset which is 20640

Now we will see how to make the split manually using the indexing.

Using manual Indexing

Here in this method, the idea is we want to specify up to which index do we want the training and from which index do we want the testing set or vice versa.

The syntax is simple and straightforward: dataset_variable[strat:end]

train = dataset[:16000]
test = dataset[16000:]

In this above example, we are simply saying the following:

From start i.e. 0 index up to the 16000 indexes, get the data rows and place them in the training variable i.e. we are making the training set containing the first 16000 rows
From 16000 to the end, put the rows in the testing set.

After running this line of code, if we now see the length of train and test, we will get the 16000 and 4640.

len(train)

16000

len(test)

4640

The thing is here we have to know the indexes up to which or from which we want the split, also one other disadvantage is that we can’t shuffle data rows here. If we have to do the shuffle, we have to add another step to the process. That is before making the split, we have to manually shuffle the dataset and then make the index-based splitting.

Now when we are using the sklearn, these steps are just one parameter away and we don’t need to worry about the indexes and other stuff.

Now let us see how to perform the same using Sklarn’s train_test_split function.

train_test_split from Sklearn

from sklearn.model_selection import train_test_split

So, this awesome and heavenly function is imported from the sklearn’s model_selection module and is simple as a cake. After importing, all we need to do is pass the dataset and the size test set that we want and we are done.

The thing to keep in mind is if we give the dataset in DataFrame form, we get two dataframes in return, one for training and one for testing.

training, testing = train_test_split(dataset, test_size=0.3, shuffle=True, random_state=32)

we have given the following parameters to this function:

Dataset – a whole dataset that we have.
test_size – the percent of data that we want for the test set.
shuffle=True, whether we want our dataset to be shuffled before making the split or not. If True, the indexes will be shuffled and then the split will be made.
random_state=any_number – if set, then it does not matter how many times we make the split, will get the same split every time.

By using this, we make sure that the test set remains always safely protected from the model.

Let us see what is in the training and testing variables that were returned by this function. training contains the part of the dataset, with 14448 rows and the indexes are shuffled also.

training

14448 rows × 10 columns

In a similar way, testing contains 6192 rows and also shuffled indexes. If we calculate 30% of this dataset, it comes the 6192. Thus 30% of our data is in the testing set.

testing

6192 rows × 10 columns

When we separate the Input and Output variables.

Now if we separate the input and output variables into x as input and y as output. Then in this dataset, we would be having median_house_value in y as it is the output variable and the rest columns will be in x as input to the model.

This x will the input to the model, on which the model will be trained along with the y. But in the future when the model will be making predictions, it would be predicting the y based on x. Let us split the dataset into x and y.

x = dataset.drop("median_house_value", axis=1)

Here we are saying, drop the median_house_value along the axis=1 i.e. whole column or in the direction of columns.

Note: axis=0 means row and axis=1 is columns.

Thus we would be having x in which there will be no median_house_value and all other columns will be present.

20640 rows × 9 columns

Which indeed is the case.

Now let us create our y by writing the following code.

y = dataset['median_house_value']

Here we are saying, get this column named median_house_value and put that in the y. Simple as it sounds.

0 452600.0 1 358500.0 2 352100.0 3 341300.0 4 342200.0 Name: median_house_value, Length: 20640, dtype: float64

Thus we have x and y.

Now if we pass the x and y in the train_test_split instead of the dataset as a whole, we will get 4 variables back in return. Two variables for the training set are input for training and output for training and similarly, Two for the testing set, and those are input for testing and output for testing.

The below code shows the output of the function with x and y

train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, shuffle=True, random_state=32)

Let us see the len of train_x and train_y and test_x and test_y.

print(fThe length of Train_x: {len(train_x)} and length of train_y: {len(train_y)})

The length of Train_x: 14448 and length of train_y: 14448

print(fThe length of Test_x: {len(test_x)} and length of test_y: {len(test_y)})

The length of Test_x: 6192 and length of test_y: 6192

Which is indeed the same as the above, when we gave the whole dataset.

Conclusion

In this post we tried to understand the need for splitting the data and How do we perform this split in an efficient way.

We learn the manual as well as the modern, sklearn’s way, which most machine learning practitioners do. I hope you understood the process and will now be able to do the same and understand the Why and How of the process.

This was all, Thank you. Please leave a comment and also share it.