Law and data
There are many great resources for learning data science and machine learning out there, but the one thing that might be missing is a live accounting of a non-technical individual learning these skills. I use the term “outsider” in the title because I don’t feel like I have the typical background that most people do on Kaggle. I am not a machine learning expert, mathematician, or expert computer programmer. I have experience in finance and law, not computer science or statistics.
My goal in documenting my Kaggle progress is to:
This is by no means a perfect explanation of how to enter and win a Kaggle competition. This is my journey, and hopefully I will get immensely better at both Kaggle competitions and writing.
One of the toughest things about this was just committing to doing it. Learning these skills seems like an overwhelming task when new information is coming out everyday. So I began looking for good resources which would give me a framework of where to start.
I have a little experience teaching myself Python over the last few years, and I decided working with real data in a competition would be the best way to learn. Approaching this, I completed most of the learning courses on Kaggle, some MOOCs, and I dabbled in a couple competitions (not live).
I needed a repeatable process I could use when approaching any Kaggle competition. The article How to “farm” Kaggle in the right way by Alex Kruegger was a very good resource for establishing a framework to approach Kaggle competitions. This is a Towards Data Science article written by a Kaggle Competitions Master on how he reached this feat in a short amount of time. He did a wonderful job of breaking down what goals can be accomplished on Kaggle and how to accomplish them.
The main takeaways for me were the iterative process he used in competitions. It laid out an outline for what steps in a competition produce the best model and output. Then the article recommended going back through the public kernels after your submission and writing down any successful approaches so they can be added to our repertoire for future competitions.
One important point I want to make is the reason for implementing other Kaggler’s best solutions. We do not want to simply copy the best kernel in the competition and submit it to try to move up the leader board. We want to read the best Kaggle solutions and try to understand them so we can implement those solutions in future competitions. The point is not to medal in competitions, but to learn how to use these tools.
After reading the article I decided I would try to implement my framework while also learning from the MOOC “How to Win a Data Science Competition” on Coursera. This way I could learn new skills while also completing a real competition.
While working through this MOOC and competition I housed my Kaggle notebooks on Github. There are a few reasons for doing this. First, I want to learn how to use git and github (A few great articles are Here and here). The second reason is I think this is a good way to keep track of my progress and look back on old notebooks at any time.
Predict Future Sales
I begin my Kaggle journey by entering the playground contest “Predict Future Sales” because it is the competition that goes with the “How to Win a Data Science Competition” MOOC on Coursera. I did not register for the course, I simply audited it.
The first thing we need to do is get an understanding of the competition. To do this we read the information provided by the competition creators. This should give us an idea of the data and goals of the competition. I recommend competitions with smaller sets of data for beginners, as larger sets will be more difficult to control and possibly require more compute power than you have available. Next, we look at the rules of the competition, which outline how we are to compete. This will include the timeline, whether teams are allowed, and anything else the creators want to specify.
Finally, we look at the evaluation of the competition. This tells us what our final prediction value should look like and how our predictions are scored. Be careful at this point. Our output file is required to list the items sold each month. The training dataset, however, only has the number of items sold per day. So the first thing we need to do is change the items sold per day to items sold per month.
The first step in my Kaggle competition is the Exploratory Data Analysis (EDA). This is exactly what is sounds like, an analysis that explores the data so we can learn more about the datasets we’re using.
Most of my time was spent on learning about the dataset and fixing small mistakes I had made when preprocessing, cleaning, and engineering features. This process is focused on the small details. If you can get those right, you can save yourself significant time. The best way to do this is to explore the data after any changes.
From reading other notebooks of more experienced kagglers, I have developed a simple system for understanding the data. I initially find the shape of the data. This data was in
form (similar to an excel notebook), this means it has clearly identifiable rows and columns. The code
is what I am using as the name of the training data dataset; you can choose to name your dataset anything you want) gives us an output describing the data.
The code shows the shape of the train and test dataset. The rows are on the left and columns on the right. So train would have 2,935,849 rows and 6 columns.
Next, I call the
command to get an idea of what the values in the dataset look like.
We can view which features (columns) are numerical and which are categorical (words like the shop names), and get a feel for what kind of data we are dealing with.
There are many other functions such as
, which I use to understand the data. More functions can be found in the documentation for the
library. This is the python library used with data structures and tables.
After exploring the data a little, I started to figure out the time series format. The data was given to us in days and we needed to group it into months for the submission. We do this with the sum function and then fill in the missing months with zero, as there were zero sales of the items in those months.
Visualization is a way to understand and explore the data. For those starting out, I would advise looking at the Data Visualization the course on Kaggle. Reading other notebooks is also an excellent way to learn many of the visualization techniques. I usually do a few basic things like make histograms, scatter plots, etc.
These are two types of plots we make in Kaggle. They both depict the number of items sold in each month. Graphs and plots like these allow us to see overarching trends in the data.
B. Univariate and Bivariate Data Analysis
There are two ways to explore the data. The first is the univariate analysis, which uses visualizations and statistics to explore a single feature in the dataset. This is a useful initial evaluation. I explored each feature individually and made a short summary in my notebook. This allowed me to have a basic understanding of each feature and how it might impact the target feature.
The Bivariate data analysis can be more in depth and revealing. A bivariate analysis analyzes two or more features at once. We are able to see their correlations and which features impact others.
C. Preprocessing and Cleaning
Preprocessing and cleaning the data can be one of the most time consuming tasks in the competition, but they are an important part of the EDA. We usually start with finding out how many missing values there are in the dataset. Most datasets have missing values and they will usually be filled with
. We use the below function to determine how many missing values are in each column.
Once we have found the missing values we can decide what to do with them. We can drop the column if there are too many, or we can fill them with zero or another number such as the median value of that feature.
Preprocessing also consists of finding any outliers in the data, such as negative numbers or extremely large numbers. We drop these values because they can have an outsized effect on the model.
Even after completing most of my EDA I still struggled to understand how the data was structured and how it could be used to predict the next month’s sales. Time series data is exactly what it sounds like, data which is recorded at different times.
In a classic classification challenge we are given data to train on, such as passengers on the titanic and whether they lived or died. Then we are given completely different passengers for the test data to determine if they lived or died. In a time series competition the same item is given in the training and testing dataset, but at different periods in time. In this competition, the training data holds items and which store they were sold in. We are then given the same items and stores in the test data, but it is at a later date and we must predict how many are sold at this later time.
To try to understand this data I created a pivot table to train on before going in depth in feature engineering. I got the idea from this notebook by
The first two columns list the shop and item id, and then the next thirty-three are the consecutive months from January 2013 to October 2015. In each row the number of items sold for that month is listed. I was now able to visualize what the time series data looked like, and I decided to run it through a model.
I decided to use an
model to determine the feature importance. Training the pivot table on this
model showed me which months were most predictive (it was the most recent), because those were my features. Doing this helped me to understand the features and how the models worked.
columns show the difference between this version and the last version you committed. The far right column (in red) is the number of lines of code you deleted or changed. The column to the left of this one (in green) shows a positive number. This is the number of lines of code added. For example, in my latest version,
, I deleted (or changed) two lines of code, and added nine lines of code.
After my initial model, we return to the feature engineering step to add new predictive features. The data is back in its original format and not in the pivot table.
We can add extra features (columns) to the dataset. This process is called feature engineering, and it is when we use existing features in the dataset to create new ones.
After looking at our feature importance and model performance on the pivot table, we were able to determine different months of sales had different predictive value. I decided that making a few moving average features should be a good start. I made a previous month feature, which created a column holding the value of the number of sales in the previous month. I then created a 6-month and 12-month moving average of sales for each item.
Next, I did some simple feature engineering by creating columns with the maximum number of each item sold, and the maximum price the item sold at.
Finally, we move to a powerful form of feature engineering called mean encoding or target encoding. This was a form of feature engineering the MOOC focused on because it is the simplest way to raise the accuracy of your model. Mean encoding takes a feature, such as item id, and finds the mean (average) target value for that specific item.
In simple terms, if item #10 was sold twice this month and zero times next month then the mean encoded feature for this item would be 1 [(2+0) / 2]. This means the average number of items sold per month is one. If we make a feature with these values, we can produce a more accurate prediction.
We cannot forget about our validation strategy
When we build a model we need to validate it. Validation involves splitting the training dataset into two parts. One set will be the data we use to train our model and the other part will be used to test the accuracy of our model. The validation set acts as our test set, and allows us to test our models before running them on our test values.
The key to our validation strategy is when we split the training dataset. We need to split it before certain forms of feature engineering to make sure it gives us an accurate prediction of how our model will perform. This means we perform our
split before we mean encode or there will be data leakage.
Data leakage is when values from the validation set leak into our training data and cause our model to perform better than it would in the real world. A simple example of data leakage would be mean encoding all our data including the validation dataset.
After we finish feature engineering we are ready to begin building our models. The modeling portion of data science is probably the most talked about portion. However, in reality this is where you will spend the least amount of your time. And even the time spent here will be mostly spent waiting for a model to train. Most of the work will be done in exploring, cleaning, and processing the data, and feature engineering.
Modeling is still, however, an important part of the Kaggle competition. If you want an overview of the different models I suggest taking some of the Kaggle Learn courses such as Intro to Machine Learning. The MOOC I am taking also has some excellent sections on modeling and how to tune hyperparameters. These skills are important for competitions, but in the beginning a much higher score will be achieved by simply focusing on good data processing habits and creating good features.
The main models I focused on and learned about in the beginning were Decision Trees, Random Forests, Linear/Logistic Regression, K-Nearest Neighbor (KNN), Support Vector Machine (SVM), Naïve Bayes Classifier, and Gradient Boosting models such as XGBoost and CatBoost.
We have our training data split into train and validation, so for each set we need to prepare our data for the model we are going to use. To prepare it I like to separate the train data into two sections: the input (X) and the output (y). These two parts of the data are used to train the model. We take the train data and make the input by dropping the target column.
Then we create the output dataset containing only the target value column (
). Now, when we train the model it knows the input will be all of the test data we are given, and the output will be the target value we are trying to predict. We will do the exact same thing for the validation set we have created. It will usually look something like this:
Once we have done this, we train the model on the training set data (X & y). Then we try to find the target value of the validation set using the model we recently trained. So we input only the X portion of the validation data into the model. We then take the output the model gives us and compare it to the true target value for the validation data (y).
There is another way to get an even higher score in a competition. It is called stacking models or the ensemble method.
In this method we create another model which makes predictions based on the outputs of the first models we used. The thinking behind this is each model we used above performs differently. The Random Forest and Linear Regression models create outputs in different ways and therefore they are each more accurate on different portions of the data. So even though random forest or
may have the best score, we can get an even better score by incorporating all of the models.
For this competition I used a simple Linear Regression model for the ensemble. The idea and process for doing this came from a notebook by
. In this notebook he creates a simple ensemble model using linear regression on all of the predictions made by his other models.
The picture above is the one used to describe the model in
notebook. I did not create this image.
This seemed like the best way to get introduced to model ensembles. It is relatively simple, and did not have any meaningful impact on my score. Its not something I am too worried about as a beginner, but it is a very powerful tool that I hope master in the future.
During the commit process I ran into a few problems in some of my notebooks. The main problem was trouble committing the code. It appeared the data was too large for the Kaggle environment. This may not be a problem if you have your own private environment you run your models in, but for a beginner like me who is doing most of his work on Kaggle, this was a problem.
A found a couple solutions to the problem which involved downsizing the data. Hopefully someone will correct me if I am wrong, but downsizing is when you change the data type of each cell (or data point). Large data types like
, which is the default for Kaggle, take up much more space when loading a dataset into a notebook. If you have a value in each cell that is not extremely large, you should be able to use a smaller data type such as
(If you want to learn more about data size and bytes, there are some good resources on Coursera).
I found a couple of scripts online that solved this problem and downsized the data when it was loaded into the notebook (here and here). This saved me a lot of time running the notebooks and allowed Kaggle to process all of the data. Even if you have not encountered this problem yet, it is definitely important to study. As data continues to grow we will have larger and larger datasets that will require skill in manipulating data types and size. This downsizing of data should not cause any data loss.
I’m not promising that doing anything in this notebook will help you win a Kaggle competition. My submission scores are about average (50%), which isn’t great, but also isn’t horrible for my first real attempt. What I am hoping to do is to share my process in an understandable way. Hopefully this will inspire others to dive in and feed their curiosity.