Difference Between Boosting Trees: Updates to Classics With CatBoost, XGBoost and LightGBM

Classical Algorithm Idea

Initialize the first simple algorithm b0
On each iteration, we make a shift vector s = (s1,..sl). si — the values of the algorithm bN(xi) = si on a training sample

Then the algorithm would be

Finally, we add the algorithm bN to the ensemble

Simple Python implementation using decision trees as weak learners:

import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from scipy.optimize import minimize_scalar

class GradientBoosting:
def init(self, n_estimators=100, learning_rate=0.1):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.trees = []
def _mse_gradient(self, y_true, y_pred):
    return -(y_true - y_pred)

def fit(self, X, y):
    n = len(y)
    F = np.full(n, np.mean(y))
    for m in range(self.n_estimators):
        residuals = self._mse_gradient(y, F)
        tree = DecisionTreeRegressor(max_depth=2)
        tree.fit(X, residuals)
        self.trees.append(tree)
        gamma = self.learning_rate
        F += gamma * tree.predict(X)

def predict(self, X):
    n = len(X)
    F = np.full(n, np.mean(y))
    for tree in self.trees:
        F += self.learning_rate * tree.predict(X)
    return F


#Usage example:

X, y = load_your_data()

gb = GradientBoosting(n_estimators=100, learning_rate=0.1)

gb.fit(X, y)

y_pred = gb.predict(X_test)

This is a basic implementation for demonstration purposes. Something similar with minor optimizations can be found in the sci-kit-learn library: sklearn.ensemble.GradientBoostingRegressor or sklearn.ensemble.GradientBoostingClassifier.

Chronology of Development and Updates

AdaBoost (1995):

The concept of boosting was introduced by Robert Schapire in 1990, but it gained widespread attention with the development of the AdaBoost algorithm by Yoav Freund and Robert Schapire in 1995.

AdaBoost, short for Adaptive Boosting, is an ensemble learning method that combines multiple weak classifiers to form a strong classifier.

It works by iteratively training classifiers on weighted training data, where the weights are updated based on the errors made by the previous classifiers.

Gradient Boosting Machines (1999):

In 1999, Jerome H. Friedman introduced the Gradient Boosting Machines (GBM) algorithm. GBM is another ensemble learning method that uses gradient descent to optimize the loss function.

Instead of adjusting the weights of training instances, GBM adds new decision trees to the model to correct the residual errors made by the existing trees. This way, GBM combines multiple weak decision trees into a single powerful model.

XGBoost (2014):

While the core idea of boosting remains the same—building an ensemble of weak learners (decision trees) sequentially to minimize the loss function—XGBoost incorporates a number of enhancements for better performance and efficiency:

XGBoost, short for "Extreme Gradient Boosting", was developed by Tianqi Chen in 2014.

Uses second-order Taylor expansion for loss approximation, leading to faster convergence and better performance.

It uses a regularized boosting technique, which controls the model complexity and helps prevent overfitting.

Offers support for parallel and distributed computing, making it faster than traditional gradient-boosting methods.

Has a slightly steeper learning curve compared to CatBoost and LightGBM, as it may require more hyperparameter tuning for optimal performance.

Tree structure: CART (Classification And Regression Trees).

Code Example:

import xgboost as xgb 
from sklearn.datasets import load_boston 
from sklearn.model_selection import train_test_split 
data = load_boston() 
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25)
model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6) 
model.fit(X_train, y_train) 
predictions = model.predict(X_test)

LightGBM (2016):

Developed by Microsoft, LightGBM stands for "Light Gradient Boosting Machine" in 2016.

Uses Gradient-based One-Side Sampling (GOSS). It keeps all the instances with large gradients (indicating larger errors) and randomly samples a smaller proportion of instances with small gradients, resulting in faster training without sacrificing accuracy.

The exclusive Feature Bundling (EFB) technique bundles exclusive (non-overlapping) features to reduce the number of features.

Tree structure: Leaf-wise trees, choosing the leaf with the maximum delta loss to split.

Code Example:

import lightgbm as lgb 
from sklearn.datasets import load_boston 
from sklearn.model_selection import train_test_split 
data = load_boston() 
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25) 
train_data = lgb.Dataset(X_train, label=y_train) 
params = {"objective": "regression", "metric": "rmse", "learning_rate": 0.1, "num_leaves": 31} 
model = lgb.train(params, train_data, num_boost_round=100) 
predictions = model.predict(X_test)

CatBoost (2017):

CatBoos, "Category Boosting" was developed in 2017.

Handles categorical features well by using a special algorithm that transforms them into numerical values.

Uses a symmetric tree growth algorithm, where every node at the same depth has the same split condition. This structure reduces the complexity of the model and speeds up the training process, while still allowing for accurate predictions.
Ordered Boosting to reduce overfitting. In traditional gradient boosting, the models are trained sequentially, with each model trying to correct the errors made by its predecessors. This process can lead to overfitting, especially when the training dataset is small or noisy.

To address this issue, ordered boosting introduces a new approach to training. Instead of training the models sequentially, ordered boosting splits the training dataset into multiple parts. Each model is then trained on a different part of the dataset, with the samples sorted by their target values.
Once a model is trained on its designated part, it is then applied to the remaining samples in the dataset. This process is repeated for all models, and the results are combined to create the final model. By training and evaluating the models in this way, ordered boosting can provide a more accurate estimation of the model's performance on unseen data, thus reducing overfitting.

Tree structure: Symmetric trees.

Code Example:

import catboost as cb 
from sklearn.datasets import load_boston 
from sklearn.model_selection import train_test_split 
data = load_boston() 
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25) 
model = cb.CatBoostRegressor(iterations=100, learning_rate=0.1, depth=6) 
model.fit(X_train, y_train, verbose=False) 
predictions = model.predict(X_test)

Conclusion

In conclusion, the evolution of gradient boosting techniques, from the pioneering AdaBoost to the more recent and sophisticated XGBoost, LightGBM, and CatBoost, has led to significant improvements in both the performance and efficiency of ensemble learning methods.

CatBoost is particularly useful when dealing with categorical features, XGBoost is known for its high performance and regularization capabilities, and LightGBM is designed for efficiency and speed.

Choosing the right library depends on the specific problem you are trying to solve and the characteristics of your dataset. It is generally a good idea to try all three to see which one performs best for your particular use case.

And, as we continue to push the boundaries of what is possible with machine learning, the lessons learned from the development and application of gradient-boosting techniques will no doubt continue to guide and inspire future innovations in the field.