paint-brush
A Complete Guide to XGBoost Model in Python using scikit-learnby@divyesh.aegis
53,474 reads
53,474 reads

A Complete Guide to XGBoost Model in Python using scikit-learn

by divyesh.aegisSeptember 5th, 2019
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A Complete Guide to XGBoost Model in Python using scikit-learn. The technique is one such technique that can be used to solve complex data-driven real-world problems. Boosting machine learning is a more advanced version of the gradient boosting method. The main aim of this algorithm is to increase speed and to increase the efficiency of your competitions. The following code is for XGBOost. The code includes importing pandas as pd from xgboost import XGBClassifier from sklearn.

Company Mentioned

Mention Thumbnail
featured image - A Complete Guide to XGBoost Model in Python using scikit-learn
divyesh.aegis HackerNoon profile picture

Generating an immeasurable amount of data has become a need to develop more advanced and sophisticated machine learning techniques. Boosting machine learning is one such technique that can be used to solve complex data-driven real-world problems.

Want to Understand Why boosting is used?

What does boosting means in machine learning

How does the boosting algorithm work?

What are the different types of boosting?

  1. Adaptive boosting
  2. Gradient boosting
  3. XGBoost

Understand how boosting machine learning algorithms can be used to improve the accuracy of a model?

Why correctly are we using boosting machine learning techniques?

Let's understand what led to the need for boosting machine learning. To solve complex and convoluted problems, we require more advanced techniques right now.

Three classes of boosting this Adaptive Boosting, Gradient Boosting and XGBoost

Adaptive Boosting is implemented by combining several weak learners into a single strong learn. Adaptive boosting starts by assigning equal weight edge to all of your data points and you draw out a decision stump for a unique input feature, so the next step is the results that you get from the first decision stump which are analyzed

If any observations are misclassified, then they are assigned higher weights this correctly. After that new decision stump is drawn by considering the representations of higher pressures as more significant.

So whichever data point was misclassified they are given a higher weight it in the next step you'll draw another decision stump that tries to classify the data points by giving more importance to the data points with more upper weight age.

Adaptive Boosting will keep looping until all the observations fall into the right class. The end goal here is to make sure that all your data points are classified into the correct courses

Gradient boosting is also based on the sequential and symbol learning model. The base learners are generated sequentially in such a way that the present based learner is always more effective than the previous one. The overall model improves sequentially with each iteration now.

The difference in this boosting is that the weights for misclassified outcomes are not incremented. Instead, in gradient increasing what you do is you try to optimize the loss function of the previous learner by adding a new adaptive model that combines weak learners.

This happens to reduce loss function. The main idea here is to overcome the errors in the previous learner's prediction

Gradient Boosting has three main components. The loss function is the one that needs to be optimized (Reduce the error) You have to keep adding a model that will regularize the loss function from the previous learner. Just like adaptive boosting gradient boosting can also be used for both classification and regression.

XGBoost has the tendency to fill in the missing values. This Method is mentioned in the following code

import xgboost as xgb
model=xgb.XGBClassifier(random_state=1,learning_rate=0.01)
model.fit(x_train, y_train)
model.score(x_test,y_test)
0.82702702702702702

Parameters:

  • thread
  • eta
  • min_child_weight
  • max_depth
  • max_depth
  • max_leaf_nodes
  • gamma
  • subsample
  • colsample_bytree

XGBoost is an advanced version of gradient boosting

It means extreme gradient boosting. Boosting falls under the category of the distributed machine learning community. XGBoost is a more advanced version of the gradient boosting method. The main aim of this algorithm is to increase speed and to increase the efficiency of your competitions

Why this model?

The following code is for XGBoost.

# importing required libraries
import pandas as pd
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
 
# read the train and test dataset
train_data = pd.read_csv('train-data.csv')
test_data = pd.read_csv('test-data.csv')
 
# shape of the dataset
print('Shape of training data :',train_data.shape)
print('Shape of testing data :',test_data.shape)
 
# Now, we need to predict the missing target variable in the test data
# target variable - Survived
 
# seperate the independent and target variable on training data
train_x = train_data.drop(columns=['Survived'],axis=1)
train_y = train_data['Survived']
 
# seperate the independent and target variable on testing data
test_x = test_data.drop(columns=['Survived'],axis=1)
test_y = test_data['Survived']

Create the object of the XGBoost model

You can also add other parameters and test your code here

Some settings are :

max_depth
and
n_estimators

Read: Documentation of xgboost

model = XGBClassifier()
 
# fit the model with the training data
model.fit(train_x,train_y)
 
 
# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train) 
 
# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)
 
# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test) 
 
# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

XGBoost was introduced because the gradient boosting algorithm was computing the output at a prolonged rate right because there's a sequential analysis of the data set and it takes a longer time

XGBoost focuses on your speed and your model efficiency. To do this, XGBoost has a couple of features. It supports parallelization by creating decision trees. There's no sequential modeling in computing methods for evaluating any large and any complex modules

One question that comes up again and again in my classes is, "Where can I get data?" There are a few answers to this question, but the best solution depends on what you are trying to learn. Data comes in all shapes and sizes.

Remember, some of the best learning comes from playing with the data. Having a question in mind that you are trying to answer with the data is a good start.

Machine learning is built up from a diverse set of tools, languages, and techniques. It's fair to say that no one solution fits most projects

Back Propagation Algorithm- Robust Mechanism

For a neural network to learn, you have to adjust the weights to get rid of most of the errors. This can be done by performing backpropagation of the error. When it comes to a simple neuron that uses the Sigmoid function as its activation function, you can demonstrate the fault as we did below.

We can consider that a general case where the weight is termed as W and the inputs as X. With this equation, the weight adjustment can be generalized, and you would have seen that this will only require the information from the other neuron levels. This is why this is a robust mechanism for learning, and it is known as the back-propagation algorithm.

Often in practice, examples of some class will be underrepresented in your training data. This is the case; for example, when your classifier has to distinguish between genuine and fraudulent e-commerce transactions: the patterns of actual sales are much more frequent. If you use SVM with soft margin, you can define a cost for mis-classified examples. Because noise is always present in the training data, there are high chances that many instances of genuine transactions would end up on the wrong side of the decision boundary by contributing to the cost.

Other Methods without Splitting training data

Boosting Instead of splitting training data into multiple data models, we can use another method like encouraging to optimize the best weighting scheme for a training set.

Given a binary classification model like SVMs, decision trees, Naive Bayesian Classifiers, or others, we can boost the training data to improve the results. Assuming that you have a similar training set to what we just described with 1,000 data points, we usually operate under the premise that all data points are necessary or that they are of equal importance.

Boosting takes the same idea and starts with the assumption that all data points are equal. But we intuitively know that not all training points are the same. What if we were able to weight each input based on what is most relevant optimally? That is what boosting aims to do. Many algorithms can do promoting, but the most popular is XGBoost.