paint-brush
The Evolution of Decision Trees: From Shannon Entropy to Modern Applications and Specialtiesby@teenl0ve
967 reads
967 reads

The Evolution of Decision Trees: From Shannon Entropy to Modern Applications and Specialties

by Valentine ShkulovApril 14th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Decision trees are a versatile and interpretable machine learning technique, with roots in the early 1960s. Modern decision tree algorithms have evolved to address limitations like overfitting and handling continuous features. The true potential of decision trees is realized within ensemble methods like Isolation Forests, Random Forests, and Boosted Trees, which improve accuracy, robustness, and generalizability. Understanding these timeless classics is essential for researchers and practitioners in the field.
featured image - The Evolution of Decision Trees: From Shannon Entropy to Modern Applications and Specialties
Valentine Shkulov HackerNoon profile picture


How it started…

The journey to Kaggle’s winning approach started in the mid-20th century, and its development has undergone various refinements and improvements since then:


  1. Early beginnings (1950s-1960s): The groundwork for decision tree algorithms can be traced back to the development of information theory by Claude Shannon in the 1940s( 1948 paper "A Mathematical Theory of Communication", and is also referred to as Shannon entropy - it became one of the most-used impurity criteria for splits in nodes. The idea of using a tree-like structure with an integrated information gain function to make decisions was inspired by this foundational work.



Shannon Entropy




  1. Development of key algorithms (1960s-1980s): In the 1960s and 1970s, researchers developed several key decision tree algorithms.


    Some of the most notable include:

    • Chi-squared Automatic Interaction Detection (CHAID) - Gordon V. Kass developed the CHAID algorithm in 1980. CHAID uses the chi-squared test to measure the significance of the association between input features and the target variable.


    • Iterative Dichotomiser 3 (ID3) - Ross Quinlan, a computer scientist, introduced the ID3 algorithm in 1986. ID3 uses a greedy top-down approach and selects the best attribute to split the dataset based on information gain.


Information Gain Functional to find the best split in node




  1. Classification and Regression Trees (CART) - Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone introduced the CART algorithm in 1984. CART uses the Gini impurity index to choose the optimal split, and it can handle both classification and regression tasks.

Gini Impurity Index



  1. C4.5 - Ross Quinlan improved upon the ID3 algorithm by introducing the C4.5 algorithm in 1993. C4.5 addressed some of the limitations of ID3, such as overfitting, handling continuous features, and pruning the tree to reduce its complexity, using normalized information gain ( gain ratio) and distributing the missing value instances to all children, but with diminished weights, proportional with the number of instances from each child node.


The algorithm

Idea and pseudo realization

It works by recursively splitting the dataset into subsets based on finding the best split in terms of Information Gain maximization, ultimately creating a tree-like structure.


As an example, here is a simplified version of the ID3 algorithm in pseudo-code:

def build_tree(data, depth=0, max_depth=None):
    # Create a node t
    t = Node()

    # Check the stopping criterion
    if stopping_criterion(data, depth, max_depth):
        # Assign a predictive model (majority class) to t
        t.predictive_model = majority_class(data)
    else:
        # Find the best binary split: data = data_left + data_right
        feature, threshold, data_left, data_right = best_binary_split(data)

        # Assign the chosen feature and threshold to the node
        t.feature = feature
        t.threshold = threshold

        # Recursively build the left and right subtrees
        t.left = build_tree(data_left, depth + 1, max_depth)
        t.right = build_tree(data_right, depth + 1, max_depth)

    return t


With the following functions according to specific requirements:


  1. Node()>: A class or function to create a tree node object.


  2. stopping_criterion(data, depth, max_depth): A function to determine when to stop splitting the tree. It should take the current data, the current depth, and an optional maximum depth as input, and return a boolean value.


  3. majority_class(data): A function to assign the majority class (or another appropriate model) to a leaf node, given the data.


  4. best_binary_split(data): A function that takes the data as input and returns the best feature, threshold, and the two resulting datasets (left and right) after splitting based on impurity criterion and functionality.

Modern usage and visualization thanks to scikit-learn


from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier,plot_tree
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)
plt.figure(figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.show()


Use Specialties

Decision tree algorithm is particularly suited for handling structured data and has several specialties that make them attractive for various use cases:


  1. Interpretability: Decision trees are easy to understand and interpret, as they mimic human decision-making processes by breaking down complex decisions into simpler ones. The tree structure is intuitive and can be visualized, making it easier for non-technical stakeholders to comprehend the model's logic.


  2. Handling of mixed data types: Decision trees can handle both categorical and numerical data without the need for preprocessing or transformation. This flexibility makes it convenient for real-world datasets that often contain a mix of data types.


  3. Feature selection: Decision trees perform implicit feature selection during the training process by selecting the most important features to split the data. This helps in reducing the complexity of the model and improving its interpretability.


  4. Non-parametric: Decision trees are non-parametric algorithms, meaning they do not make any assumptions about the underlying data distribution. This makes them more robust to outliers and enables them to model complex relationships between features and the target variable.


  5. Scalability: Decision tree algorithms can be easily scaled to handle large datasets, especially when using optimization techniques like pruning or random forests, which combine multiple trees to improve the overall performance.


  6. Insensitive to monotonic transformations: Decision trees are not affected by monotonic transformations of the input features, such as scaling or normalization. This simplifies the preprocessing steps and reduces the risk of introducing errors during data preparation.


  7. Easy to validate: Decision trees can be easily validated using statistical tests to assess the quality of the splits, providing an additional level of confidence in the model's performance.


  8. Handling missing values: Decision tree algorithms can handle missing data by using strategies like surrogate splits, which find alternative splitting criteria when data is missing for the primary attribute.


Why trees are so important and where to find them

The Decision Tree algorithm is categorized alongside other less powerful models like linear/logistic regression, KNN, and SVM, owing to its inherent limitations:


  • Prone to overfitting

  • Sensitive to small changes in data

  • May create biased trees if some classes dominate


Yet, the grandeur and might of Decision Tree foundations, akin to their genuine counterparts, are truly revealed within sophisticated ensembles and crucial details:


  1. Isolation Forests: Isolation Forests, which are based on Isolation Trees, are an anomaly detection method. They use a unique approach to build decision trees by randomly selecting features and split values, which isolates outliers faster than normal instances.


    The decision tree structure is crucial for this method to efficiently detect anomalies.


    • How they are built
      • Randomly select a feature and split value between the minimum and maximum values of the selected feature.


      • Recursively partition the data based on the split value until all instances are isolated or a predefined tree depth is reached.


      • Repeat the process to build multiple trees, forming an Isolation Forest.


      • Calculate anomaly scores for each instance by averaging the path length (number of splits) across all trees in the forest.


    • Advantages over a single Decision Tree:
      • More robust to noise: Isolation Forests are less susceptible to noise, as they are built using random splits.


      • Efficient anomaly detection: They can detect anomalies faster than a single decision tree due to their isolation mechanism.


      • Reduced overfitting: The ensemble of trees in an Isolation Forest reduces the risk of overfitting compared to a single tree.


  2. Random Forests: Random Forests are an ensemble of decision trees that are built by bootstrapping the dataset and selecting random subsets of features at each node. This method reduces overfitting and improves generalization compared to a single decision tree. The decision tree algorithm serves as the building block for each tree in the ensemble.


    • How they are built:
      • Create multiple bootstrap samples (with replacement) from the original dataset.

      • For each bootstrap sample, construct a decision tree by selecting a random subset of features at each node and choosing the best split.

      • Combine the trees by averaging their predictions for regression tasks or by taking a majority vote for classification tasks.


    • Advantages over a single Decision Tree:
      • Improved accuracy: Random Forests usually yield better results due to the diversity of trees in the ensemble.


      • Reduced overfitting: The process of bootstrapping and feature selection reduces the risk of overfitting.


      • Better handling of imbalanced data: Random Forests can balance class distribution by using class weights or by oversampling underrepresented classes.

      • Robust to noise: The ensemble approach makes Random Forests more robust to noise and outliers.


  3. Boosted Trees: Boosting techniques build an ensemble of decision trees iteratively. Each tree is trained to correct the errors made by the previous tree. The decision tree algorithm is the core learning method in these models, and its ability to capture complex relationships and interactions is key to the success of these methods.


    • How they are built:
      • Initialize the model with a constant prediction value.

      • For a predefined number of iterations:

        • Compute the negative gradients (pseudo-residuals) of the loss function with respect to the current model predictions.
        • Fit a decision tree to the pseudo-residuals.
        • Update the model by adding the newly fitted tree, multiplied by a learning rate.
      • Combine the trees by summing their weighted predictions.


    • Advantages over a single decision tree:
      • Higher accuracy: Boosted Trees often achieve better performance due to the iterative process of correcting errors made by previous trees.


      • Reduced overfitting: Regularization techniques, such as shrinkage (learning rate) and early stopping, help prevent overfitting.


      • Better handling of imbalanced data: Boosted Trees can handle class imbalance by adjusting the loss function or by applying instance weighting.


      • Flexibility: Boosted Trees can work with various loss functions, making them suitable for different tasks (e.g., classification, regression, ranking).


  4. Feature Importance: The primary goal of feature importance is to rank the variables in the dataset based on their contribution to the prediction accuracy or information gain.


    This information can be transformed into:

    • model interpretation with an explanation of the rationale behind its predictions with unveiling the data behind


    • feature selection with the identification of the most relevant features in a dataset, enabling the development of simpler and more efficient models

In Conclusion

Decision Trees have come a long way since their inception in the 1960s. Their versatility, interpretability, and ability to handle diverse data types have made them a popular choice for various applications across numerous fields. While standalone decision trees may suffer from limitations such as overfitting and sensitivity to small data changes, their true potential is harnessed within ensemble methods like Isolation Forests, Random Forests, and Boosted Trees. These advanced techniques leverage the power of decision trees to achieve improved accuracy, robustness, and generalizability and became the most popular techniques from Kaggle to corporate development. As machine learning continues to evolve, understanding and utilizing these timeless classics will remain invaluable for both researchers and practitioners in the field.