How it started… The journey to Kaggle’s winning approach started in the mid-20th century, and its development has undergone various refinements and improvements since then: : The groundwork for decision tree algorithms can be traced back to the development of information theory by Claude Shannon in the 1940s( 1948 paper " ", and is also referred to as - it became one of the most-used impurity criteria for splits in nodes. The idea of using a tree-like structure with an integrated information gain function to make decisions was inspired by this foundational work. Early beginnings (1950s-1960s) A Mathematical Theory of Communication Shannon entropy : In the 1960s and 1970s, researchers developed several key decision tree algorithms. Development of key algorithms (1960s-1980s) Some of the most notable include: Chi-squared Automatic Interaction Detection (CHAID) - Gordon V. Kass developed the in 1980. CHAID uses the chi-squared test to measure the significance of the association between input features and the target variable. CHAID algorithm Iterative Dichotomiser 3 (ID3) - Ross Quinlan, a computer scientist, introduced the in 1986. ID3 uses a greedy top-down approach and selects the best attribute to split the dataset based on information gain. ID3 algorithm ) - Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone introduced the in 1984. CART uses the Gini impurity index to choose the optimal split, and it can handle both classification and regression tasks. Classification and Regression Trees (CART CART algorithm C4.5 - Ross Quinlan improved upon the ID3 algorithm by introducing the C4.5 algorithm in 1993. C4.5 addressed some of the limitations of ID3, such as overfitting, handling continuous features, and pruning the tree to reduce its complexity, using normalized information gain ( gain ratio) and distributing the missing value instances to all children, but with diminished weights, proportional with the number of instances from each child node. The algorithm Idea and pseudo realization It works by recursively splitting the dataset into subsets based on finding the best split in terms of Information Gain maximization, ultimately creating a tree-like structure. As an example, here is a simplified version of the ID3 algorithm in pseudo-code: def build_tree(data, depth=0, max_depth=None):
    # Create a node t
    t = Node()

    # Check the stopping criterion
    if stopping_criterion(data, depth, max_depth):
        # Assign a predictive model (majority class) to t
        t.predictive_model = majority_class(data)
    else:
        # Find the best binary split: data = data_left + data_right
        feature, threshold, data_left, data_right = best_binary_split(data)

        # Assign the chosen feature and threshold to the node
        t.feature = feature
        t.threshold = threshold

        # Recursively build the left and right subtrees
        t.left = build_tree(data_left, depth + 1, max_depth)
        t.right = build_tree(data_right, depth + 1, max_depth)

    return t With the following functions according to specific requirements: : A class or function to create a tree node object. Node()> : A function to determine when to stop splitting the tree. It should take the current data, the current depth, and an optional maximum depth as input, and return a boolean value. stopping_criterion(data, depth, max_depth) : A function to assign the majority class (or another appropriate model) to a leaf node, given the data. majority_class(data) : A function that takes the data as input and returns the best feature, threshold, and the two resulting datasets (left and right) after splitting based on impurity criterion and functionality. best_binary_split(data) Modern usage and visualization thanks to scikit-learn from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier,plot_tree
clf = DecisionTreeClassifier(random_state=0)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)
plt.figure(figsize=(12, 8))
plot_tree(clf, filled=True, feature_names=iris.feature_names, class_names=iris.target_names)
plt.show() Use Specialties Decision tree algorithm is particularly suited for handling structured data and has several specialties that make them attractive for various use cases: Decision trees are easy to understand and interpret, as they mimic human decision-making processes by breaking down complex decisions into simpler ones. The tree structure is intuitive and can be visualized, making it easier for non-technical stakeholders to comprehend the model's logic. Interpretability: Decision trees can handle both categorical and numerical data without the need for preprocessing or transformation. This flexibility makes it convenient for real-world datasets that often contain a mix of data types. Handling of mixed data types: Decision trees perform implicit feature selection during the training process by selecting the most important features to split the data. This helps in reducing the complexity of the model and improving its interpretability. Feature selection: : Decision trees are non-parametric algorithms, meaning they do not make any assumptions about the underlying data distribution. This makes them more robust to outliers and enables them to model complex relationships between features and the target variable. Non-parametric Decision tree algorithms can be easily scaled to handle large datasets, especially when using optimization techniques like pruning or random forests, which combine multiple trees to improve the overall performance. Scalability: Decision trees are not affected by monotonic transformations of the input features, such as scaling or normalization. This simplifies the preprocessing steps and reduces the risk of introducing errors during data preparation. Insensitive to monotonic transformations: Decision trees can be easily validated using statistical tests to assess the quality of the splits, providing an additional level of confidence in the model's performance. Easy to validate: Decision tree algorithms can handle missing data by using strategies like surrogate splits, which find alternative splitting criteria when data is missing for the primary attribute. Handling missing values: Why trees are so important and where to find them The algorithm is categorized alongside other less powerful models like linear/logistic regression, KNN, and SVM, owing to its inherent limitations: Decision Tree Prone to overfitting Sensitive to small changes in data May create biased trees if some classes dominate Yet, the grandeur and might of Decision Tree foundations, akin to their genuine counterparts, are truly revealed within sophisticated ensembles and crucial details: : Isolation Forests, which are based on Isolation Trees, are an anomaly detection method. They use a unique approach to build decision trees by randomly selecting features and split values, which isolates outliers faster than normal instances. Isolation Forests The decision tree structure is crucial for this method to efficiently detect anomalies. How they are built Randomly select a feature and split value between the minimum and maximum values of the selected feature. Recursively partition the data based on the split value until all instances are isolated or a predefined tree depth is reached. Repeat the process to build multiple trees, forming an Isolation Forest. Calculate anomaly scores for each instance by averaging the path length (number of splits) across all trees in the forest. Advantages over a single Decision Tree: : Isolation Forests are less susceptible to noise, as they are built using random splits. More robust to noise : They can detect anomalies faster than a single decision tree due to their isolation mechanism. Efficient anomaly detection : The ensemble of trees in an Isolation Forest reduces the risk of overfitting compared to a single tree. Reduced overfitting : Random Forests are an ensemble of decision trees that are built by bootstrapping the dataset and selecting random subsets of features at each node. This method reduces overfitting and improves generalization compared to a single decision tree. The decision tree algorithm serves as the building block for each tree in the ensemble. Random Forests How they are built: Create multiple bootstrap samples (with replacement) from the original dataset. For each bootstrap sample, construct a decision tree by selecting a random subset of features at each node and choosing the best split. Combine the trees by averaging their predictions for regression tasks or by taking a majority vote for classification tasks. Advantages over a single Decision Tree: : Random Forests usually yield better results due to the diversity of trees in the ensemble. Improved accuracy : The process of bootstrapping and feature selection reduces the risk of overfitting. Reduced overfitting : Random Forests can balance class distribution by using class weights or by oversampling underrepresented classes. Better handling of imbalanced data : The ensemble approach makes Random Forests more robust to noise and outliers. Robust to noise : Boosting techniques build an ensemble of decision trees iteratively. Each tree is trained to correct the errors made by the previous tree. The decision tree algorithm is the core learning method in these models, and its ability to capture complex relationships and interactions is key to the success of these methods. Boosted Trees How they are built: Initialize the model with a constant prediction value. For a predefined number of iterations: Compute the negative gradients (pseudo-residuals) of the loss function with respect to the current model predictions. Fit a decision tree to the pseudo-residuals. Update the model by adding the newly fitted tree, multiplied by a learning rate. Combine the trees by summing their weighted predictions. Advantages over a single decision tree: Higher accuracy: Boosted Trees often achieve better performance due to the iterative process of correcting errors made by previous trees. Reduced overfitting: Regularization techniques, such as shrinkage (learning rate) and early stopping, help prevent overfitting. Better handling of imbalanced data: Boosted Trees can handle class imbalance by adjusting the loss function or by applying instance weighting. Flexibility: Boosted Trees can work with various loss functions, making them suitable for different tasks (e.g., classification, regression, ranking). The primary goal of feature importance is to rank the variables in the dataset based on their contribution to the prediction accuracy or information gain. Feature Importance: This information can be transformed into: model interpretation with an explanation of the rationale behind its predictions with unveiling the data behind feature selection with the identification of the most relevant features in a dataset, enabling the development of simpler and more efficient models In Conclusion Decision Trees have come a long way since their inception in the 1960s. Their versatility, interpretability, and ability to handle diverse data types have made them a popular choice for various applications across numerous fields. While standalone decision trees may suffer from limitations such as overfitting and sensitivity to small data changes, their true potential is harnessed within ensemble methods like Isolation Forests, Random Forests, and Boosted Trees. These advanced techniques leverage the power of decision trees to achieve improved accuracy, robustness, and generalizability and became the most popular techniques from Kaggle to corporate development. As machine learning continues to evolve, understanding and utilizing these timeless classics will remain invaluable for both researchers and practitioners in the field.

This story contains new, firsthand information uncovered by the writer.

The Evolution of Decision Trees: From Shannon Entropy to Modern Applications and Specialties

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

Advanced Techniques for Time Series Data Feature Engineering

10+ Questions you Should ask Yourself Before Developing an AI Solution

10 Best AI Content Generation Tools for All Your Content Needs in 2022

Top 10 AI Development Companies in USA

The Noonification: Heres The AI Dubbing Thats Raising Alarms World Over (1/30/2024)

Advanced Techniques for Time Series Data Feature Engineering

10+ Questions you Should ask Yourself Before Developing an AI Solution

10 Best AI Content Generation Tools for All Your Content Needs in 2022

Top 10 AI Development Companies in USA

The Noonification: Heres The AI Dubbing Thats Raising Alarms World Over (1/30/2024)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps