Dataset is one important part of the machine learning project. Without data, machine learning is just the machine, and learning is stripped from the title. Which the machine and deep learning people will not stand and they will protest for the data 😅 . Introduction Being data this important, when we have a dataset containing a few hundred or fewer data rows, we must keep some part of the data for testing purposes. Once the software is ready, we must test it before putting it on the market where it would make decisions and affect people’s lives. so the model is not expected to make errors or if, then least. Thus testing becomes an important part of the development and in machine learning since we are working with data so we must have some data for testing purposes also. The purpose of testing data is such that, once kept aside, this data should never be made available to the machine learning model until the training has been finished, and then and only then should we introduce the testing data to the machine learning model that was trained on some data of which this testing data is part of. Since we have understood the testing data, let us see do we keep aside this data for testing purposes? why How we will work with the from , we will load the dataset with and then make the . We can do the splitting in two ways: california dataset Kaggle pandas spliting Manual by choosing the ranges of indexes Using a function from . Sklearn import pandas as pd dataset = pd.read_csv(housing.csv) dataset.head() After the data has been loaded, we can see using the , first 5 rows of the dataset to get a glimpse of the dataset. Now as we have the full dataset in the dataset variable, we are ready to make the split. But first, let us see the length of the dataset so that we can see this length again after making the and verify that the split was successful. head() split len(dataset) 20640 This simple function gives us the number of rows that are present in the dataset which is len() 20640 Now we will see how to make the split . manually using the indexing Using manual Indexing Here in this method, the idea is we want to specify up to which index do we want the and from which index do we want the or vice versa. training testing set The syntax is simple and straightforward: dataset_variable[strat:end] train = dataset[:16000] test = dataset[16000:] In this above example, we are simply saying the following: From start i.e. 0 index up to the 16000 indexes, get the data rows and place them in the training variable i.e. we are making the training set containing the first 16000 rows From 16000 to the end, put the rows in the testing set. After running this line of code, if we now see the length of and , we will get the and . train test 16000 4640 len(train) 16000 len(test) 4640 The thing is here we have to know the indexes up to which or from which we want the split, also one other disadvantage is that we can’t shuffle data rows here. If we have to do the shuffle, we have to add another step to the process. That is before making the split, we have to manually shuffle the dataset and then make the index-based splitting. Now when we are using the sklearn, these steps are just one parameter away and we don’t need to worry about the indexes and other stuff. Now let us see how to perform the same using Sklarn’s function. train_test_split train_test_split from Sklearn from sklearn.model_selection import train_test_split So, this awesome and heavenly function is imported from the sklearn’s module and is simple as a cake. After importing, all we need to do is pass the dataset and the size that we want and we are done. model_selection test set The thing to keep in mind is if we give the dataset in form, we get two in return, one for and one for . DataFrame dataframes training testing training, testing = train_test_split(dataset, test_size=0.3, shuffle=True, random_state=32) we have given the following parameters to this function: – a whole dataset that we have. Dataset – the percent of data that we want for the test set. test_size , whether we want our dataset to be shuffled before making the split or not. If , the indexes will be shuffled and then the split will be made. shuffle=True True – if set, then it does not matter how many times we make the split, will get the same split every time. random_state=any_number By using this, we make sure that the test set remains always safely protected from the model. Let us see what is in the training and testing variables that were returned by this function. training contains the part of the dataset, with 14448 rows and the indexes are shuffled also. training 14448 rows × 10 columns In a similar way, testing contains 6192 rows and also shuffled indexes. If we calculate of this dataset, it comes the 6192. Thus of our data is in the testing set. 30% 30% testing 6192 rows × 10 columns When we separate the Input and Output variables. Now if we separate the input and output variables into x as input and y as output. Then in this dataset, we would be having median_house_value in y as it is the output variable and the rest columns will be in as input to the model. x This will the input to the model, on which the model will be trained along with the . But in the future when the model will be making predictions, it would be predicting the based on . Let us split the dataset into and . x y y x x y x = dataset.drop("median_house_value", axis=1) Here we are saying, drop the along the i.e. whole column or in the direction of columns. median_house_value axis=1 Note: means row and is columns. axis=0 axis=1 Thus we would be having in which there will be no and all other columns will be present. x median_house_value x 20640 rows × 9 columns Which indeed is the case. Now let us create our by writing the following code. y y = dataset['median_house_value'] Here we are saying, get this column named and put that in the . Simple as it sounds. median_house_value y y 0 452600.0 1 358500.0 2 352100.0 3 341300.0 4 342200.0 Name: median_house_value, Length: 20640, dtype: float64 Thus we have and . x y Now if we pass the and in the instead of the dataset as a whole, we will get 4 variables back in return. Two variables for the training set are input for training and output for training and similarly, Two for the testing set, and those are input for testing and output for testing. x y train_test_split The below code shows the output of the function with x and y train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.3, shuffle=True, random_state=32) Let us see the of and and and . len train_x train_y test_x test_y print(fThe length of Train_x: {len(train_x)} and length of train_y: {len(train_y)}) The length of Train_x: 14448 and length of train_y: 14448 print(fThe length of Test_x: {len(test_x)} and length of test_y: {len(test_y)}) The length of Test_x: 6192 and length of test_y: 6192 Which is indeed the same as the above, when we gave the whole dataset. Conclusion In this post we tried to understand the need for splitting the data and How do we perform this split in an efficient way. We learn the manual as well as the modern, sklearn’s way, which most machine learning practitioners do. I hope you understood the process and will now be able to do the same and understand the Why and How of the process. This was all, Thank you. Please leave a comment and also share it.