Decision Trees theory is a method used in machine learning and data analysis that allows building decision-making models with tree-shaped hierarchy. In each node of the tree, a certain criterion is being checked, and leaves represent final decisions or predicted outcomes. Such trees have high interpretability and can be used for solving classification and regression tasks.
Digging a bit deeper into this theory, it’s pretty easy to find out that:
Nodes can be split not just by one criterion but by several as well.
It is not necessary to use just one of the criteria implemented in the commonly used sci-kit-learn library for branching.
The node can be split into more than two branches.
Logical question to arise would be: isn’t there anything better than regular decision trees? And the answer to this question would be: yes, there is—but with its own nuances.
Despite the great popularity of deep learning, researchers are still widely working on improving the decision trees. Among these “upgrades” are so-called oblique decision trees. In a regular tree, each node has a splitting rule by a single feature, while in oblique trees, each node contains rules based on a combination of several features. The simplest example of a combination is linear—let’s take a closer look at it.
Oblique decision trees allow performing node splitting based on a combination of features instead of just one feature. This linear combination can improve the generalization performance of the model by combining the advantages of individual trees.
Let us imagine that we have a classification task based on two features with the elements of the training sample that are distributed like in the image below (different classes are marked by different colors).
Understandably, the researchers will be eager to identify such situations and use tilted or oblique, straight lines—which represent a linear combination of features—to divide the space. In regular trees, the splitting rules are interpreted in the graphic form as horizontal or vertical straight lines. Oblique tree types, on the other hand, can create diagonal decision boundaries in the space of features, which may be more appropriate for certain datasets. It is because of the oblique division lines these types of trees are called oblique.
Oblique decision trees can be used in various fields of machine learning and data analysis, where regular decision trees don’t bring significant results. Here are some examples of such areas:
Complex data spaces. In datasets where decision boundaries are not axially parallel, oblique trees may be more effective at separating classes or detecting nonlinear dependencies.
High-dimensional data. In situations where you have many attributes and some of them may be correlated, oblique trees can provide more accurate models because they can account for combinations of attributes.
Areas that require interpretability. Oblique trees may be more complex than regular decision trees, but they still retain a tree structure that can be visualized and interpreted.
Classification and regression tasks. Similar to standard decision trees, oblique trees can be used for both classification and regression tasks.
It is worth noting that, as a rule, oblique trees are lower in height yet require significantly more time for training to reach the quality of regular trees. But if you are willing to sacrifice time on training in favor of better quality predictions, you can experiment with them as well.
Considering random trees, when we see that a splitting occurs in a tree node, a random subset of features is selected as candidates for the splitting rather than all available features), we can tell that the methodology is well developed. There are algorithms for creating optimal random trees, but in practice, such methods as random forest and, to a higher extent, boosting outperform individual trees. For a random forest to perform effectively, it is important that the trees are uncorrelated and have good generalization performance. Under these conditions, oblique trees show statistically significantly superior results.
Let’s take a closer look at the method of building a DRaF-LDA (Rotation-based double random forest with linear discriminant analysis). It is different from a classic random forest (Classification and regression trees, or CART) in two ways.
Bootstrapping. In the CART algorithm, bootstrapping is supposed to be performed at the initial step, and start building a tree using the resulting dataset and reducing the sample at each step down on the tree. In DRaF, bootstrapping of all data is performed directly for each individual node, taking into account the limitations of the previous node splitting occurrences.
Node splitting. In the case of CART, a subset of features is selected for node splitting, among which the best possible node splitting variant is chosen. In DRaF-LDA, as it is easy to guess, the Linear Discriminant Analysis (LDA) algorithm is used for node partitioning. LDA allows performing better splitting than one single feature.
Using these two “deviations” from the classic tree-building algorithm for a random forest allows for generating trees that are better in quality and uncorrelated. This gives a gain in quality for the random forest algorithm.
There are numerous approaches to forming linear combinations of features for splitting.
To learn more about the existing variants you can read the article titled “
Oblique and rotation double random forest ,” written by M.A. Ganaie, M. Tanveer, P.N. Suganthan and V. Snasel.
In the article, the researchers proposed a new variant of the forest (trees) that has a statistically significant improvement in prediction quality compared to a common implementation.
The research of classic machine learning algorithms is still ongoing, sometimes even making interesting advances. This is why it’s better not to stick to the default libraries and instead keep studying the subject area and choosing the most suitable algorithms for your tasks. Perhaps it is oblique trees that will work well in your case.