HDTree: A Customizable and Interactable Decision Tree Written in Python

Introducing a customizable and interactable Decision Tree-Framework written in Python

Fast Track

Repository: https://github.com/Mereep/HDTree
Complementary Notebook: inside
```
examples
```
directory of the repository or directly here (every illustration you see here will be generated in the notebook. You will be able to create them on your own)

What’s inside the story?

This story will introduce yet another implementation of Decision Trees, which I wrote as part of my thesis. The work will be divided into three chapters as follows:

Firstly, I will try to motivate why I have decided to take my time to come up with an own implementation of Decision Trees; I will list some of its features but also will list the disadvantages of the current implementation.

Secondly, I will guide you through the basic usage of HDTree using code snippets and explaining some details along the way.

Lastly, there will be some hints on how to customize and extend the HDTree with your own chunks of ideas.

However, this article will not guide you through all of the basics of Decision Trees. There are really plenty of resources out there [1][2][3][16]. I think there is no need in repeating all of that again. Others have done that. I will not be able to do it better. You don’t need to be an expert in Decision Trees to understand this article. A basic level of understanding should be sufficient to follow up. Some experience in the ML domain is a plus, though.

Motivation & Background

For my work I came along working with Decision Trees. My actual goal is to implement an human-centric ML-model, where HDTree (Human Decision Tree for that matter) is an optional ingredient which is used as part of an actual user interface for that model. While this story solely focuses on HDTree, I might write a follow-up describing the other components in detail.

Features of HDTree & Comparison with scikit learn Decision Trees

Naturally, I stumbled upon the scikit-learn-implementation⁴ of decision trees. I guess many practitioners do. And lets make something clear from the beginning: nothing is wrong with it.

The sckit-learn implementation has a lot of pros:

it’s fast & optimized
The implementation is written in a dialect called Cython⁵. Cython compiles to C-Code (which in turn compiles to native code) while maintaining interoperability with the Python interpreter.
it’s easy and to use and convinient
Many people in the ML-domain know how to work with scikit-learn models. You will easily find help everywhere due to its user base.
it’s battle tested (a lot of people are using it)
It just works
it supports many pre-pruning and post-pruning [6] methods and provides many features (e.g., Minimal Cost-Complexity Pruning³ or sample weights)
it support basic visualization [7]

That said, surely it also has some shortcomings:

t’s not trivial to modify, partly due to the usage of the rather uncommon Cython dialect (see advantages above)
no way to incorporate user knowledge about the domain or to modify the learning process
the visualization is rather minimalistic
no support for categorical attributes / features
no support for missing values
interface for accessing nodes and traversing the tree is cumbersome an not intuitive
no support for missing values
only binary splits (see later)
no multivariate splits (see later)

Features HDTree

HDTree comes with a solution to most of the shortcomings mentioned in the above list, while sacrificing many of the advantages of the scikit-learn implementation. We will come back to those points later, so don’t worry if you do not understand every part of the following list yet:

👍 interact with the learning-behavior
👍 core components are modular and fairly easy to extend (implement an interface)
👍 purely written in Python (more approachable)
👍 rich visualization
👍 support categorical data
👍 support for missing values
👍 support for multivariate splits
👍 easy interface to navigate through the tree structure
👍 supports for n-ary splits (> 2 child nodes)
👍 textual representations of decision paths
👍 encourages explainability by printing human-readable text
-------------------------------------------------------------------------------------------------
👎 slow
👎 not battle-tested (it will have bugs)
👎 mediocre software quality
👎 not so many pruning options (it supports some basic options, though)

⚠️ Although the disadvantages seem to be not too numerous, they are critical. Let us make that clear right away: Do not throw big data at it. You will wait forever. Do not use it in production. It may break unexpectedly. You have been warned!⚠️

Some of these problems may get fixed over time. However, the training speed probably will remain slow (inference is okay, though). You will have to come up with a better solution to fix that. You are very welcome to contribute 😃.

That said, what would be possible use cases?

extract knowledge from your data
test the intuition you have about your data
understand the inner workings of decision trees
explore alternative causal relationships regarding to your learning problem
use it as part of your more complex algorithms
create reports and visualizations
use it for any research-related purposes
have an accessible platform to easily test your idea idea for decision tree algorithms

Structure of Decision Trees

Although this work will not review decision trees in detail, we will recap their basic building blocks. This will provide a basis to understand the examples later and also highlights some of HDTrees features. The following graphic is an actual output of the HDTree (except for the markers)

a (Nodes)

ai: textual description of the test/split rule which was used at this node to separate the data into its children. Displays the relevant attribute(s) and the verbal description of the operation. Those tests are highly configurable and can include any data-separating logic you can come up with. Development of your own custom rules is supported by implementing an interface. More on that in section 3.

aii: the score of the node measures its pureness, i.e., how good the data that passes through the node is separated. The higher the score, the better. High scores are also represented by the nodes’ background color. The more greenish, the higher the score (white means unpure, i.e. equally distributed classes). Those scores direct the induction (building) process of the tree and are a modular and exchangeable component of the HDTree.

aiii: the nodes’ border indicates how many data points are passing through this node. The thicker the border the more data is passing through that node.

aiv: list of the targets / labels (i.e., the prediction goal) which the data points passing that node have. The most common class is marked with a ✔.

av: optionally a visualization can mark the path that an individual data points follows (illustrate the decision that are made when the data point passes the tree). This is marked with a line at the corner of the decision tree.

b (Edges)

bi: the arrow connect each possible outcome of the split (ai) with its child nodes. The more of the data (relatively to the parent) is ‘flowing’ along the edge, the thicker it is drawn.

bii: each edge has a readable textual representation of the splits’ corresponding outcome.

Why different splits / tests?

At this point of reading you might already wonder, what is different from the HDTree to a scikit-learn tree (or any other implementation) and why we might want to have different kinds of splits? Let’s try to make this clear.

Maybe you have an intuitive understanding of the notion of feature space. All the data we are working with lies in some multi-dimensional space which is defined by the amount and the type of the attributes / features your data have.

The task of a classification algorithm now is to partition this space into non-overlapping regions and assigning those region a class / target / outcome, whatever you like to call it. Let's visualize that. Since our brains have troubles fiddling with high dimensionality, we will stick with a 2-dimensional example and a very simple 2-class problem as follows:

You see a very simple data set, which is made of two dimensions (features / attributes) and two classes. The generated data points were normal distributed at the center. A street which is just the linear function

f(x) = y

separates those two classes into Class 1 (South East) and Class 2 (North West). Also, some random noise (blue data points in orange area and vice versa) was added in order to illustrate effects of overfitting later on.

The task of a classification algorithm like HDTree (though, it can also be used for regression tasks) is to learn, which class each data point belongs to. In other words: given some pair of coordinates

(x, y)

(6, 2)

the task is to learn if this coordinate belongs to the orange Class 1 or the blue Class 2. A discriminative model will try to separate the feature space (here the (x,y) -axis) into blue and orange territories respective.

Given that data, the decision (rules) on how data would be separated (classified) seems very easy. A reasonable human being would say (think on your own first):

“It’s Class 1 if x > y, otherwise class 2”.

The perfect separation would created by the function

y=x

which is illustrated as reference as dashed line. Indeed a large margin classifier like a support vector machine [8] would come up with a similar solution. But let’s see what decision trees tempt to learn instead:

The image illustrates the regions in which a standard decision tree with increasing depths would classify a data point as class 1 (orange) or class 2 (blue).

A decision tree will approximate the linear function using a step function

This is due to the type of test / split rule which decision trees use. They are all of the scheme

attribute

 threshold

which will result in axis-parallel hyperplanes. In the 2D-space its just like rectangles getting “cut out”. In 3D-space it would be cuboids… and so on.

Furthermore, the decision tree starts to model the noise within the data when having 8 levels already, i.e., it is overfitting. While doing so it never finds a good approximation of the real linear function. To verify this, I used a typical train test split (2:1) of the data and evaluated the trees' scores, which are 93.84%, 93,03%, 90,81% for the test set and 94,54%, 96,57%, 98,81% for the training set (ordered by the trees' depth 4, 8, 16). While the test accuracy is decreasing the training accuracy is increasing.

Having an increasing training performance and a dropping test performance is the hallmark of overfitting

The resulting decision trees are also quite complex for such a simple function. The most simple of the them (depth 4) visualized by scikit learn looks already like that:

Quite a bummer, I will save you from the more complex ones. The next section will start by solving that problem using the HDTree package. HDTree will enable the user to incorporate knowledge about the data (just like knowing it is linearly separable in the example). Also it will allow you finding playing with alternative solutions to your problem.

Using the HDTree package 🌲

This section will guide you through the basics of HDTree. I will try to touch some parts of it API. Please feel free to ask in the comments section or contact me, if you have any questions on that. I will happily answer and extend the article if needed.

Installing HDTree 💻

This is slightly more complicated than pip install hdtree. Sorry 😅.

Create an empty directory and create an folder named “hdtree” inside there (
```
your_folder/hdtree
```
)
Download or clone the repository into the hdtree directory (not in a sub sub directory)
Install the necessary dependencies / packages (numpy, pandas, graphviz, sklearn, Python ≥ 3.5)
Add
```
your_folder
```
into your python search path. This will include it into Pythons importing mechanism. You can use it as a common python package then

Alternatively you could add hdtree to the site-packages folder of your python installation. I might add an installation file later. At the moment of writing, the code is not available in the pip repository.

All codes that generate the graphics and outputs below (and also the previously seen) are inside the repository or directly hosted here.

Solving this linear problem with a one-level decision tree 📈

Let’s start with some code right away:

Yes, the resulting tree is only one level high and has the ideal solution to that problem. Admittedly, this example is somehow staged to show the effect. Still, I hope it makes the point clear: having an intuition about the data or by just providing the decision tree with various options to partition the feature space, it might come up with an simpler and sometimes even more accurate solution.

Imagine having to interpret the rules from those previous trees presented until here to find some actionable information. Which one you can understand first and which one you would trust more? The complicated one which uses multiple step functions or the small accurate tree? I guess the answer is quite simple.

But let’s dig a bit into the actual code. When initializing the HDTreeClassifier, the most important thing you’ll have to provide are the

allowed_splits

. There you provide a list containing possible tests / split rules which the algorithm tries at construction / training time for every node in order to find a good local partition of the data.

In this case we solely supplied a

SmallerThanSplit

. This split does exactly what you see: it takes two attributes (it will try any combination) and separates the data in the scheme

a_i < a_j

. Which (not too much by random) fits the given data as good as it could be.

This type of split is denoted as a multivariate split. Which means the split is using more than one attribute for the partition-decision. This is unlike the univariate splits that are used in most other trees like the scikit-tree (see above for details) that take exactly one attribute into account. Surely,

HDTree

also has options to achieve a ‘normal split’ like the ones in the scikit-trees (

QuantileSplit

-Family). We will learn more splits along our journey.

The other unfamiliar thing you might see in the code is the models hyper parameter

information_measure

. There, you provide the model with the measurement that is used to estimate the value of a single node or a complete split (a parent node with its children). The chosen option here is based on Entropy [10]. Maybe you have also heard of the Gini-Index, which would be another valid option here. Surely enough you can provide your very own information measures by just implementing the appropriate interface. If you’d like: go ahead and implement the Gini-Index, which you could drop directly into the tree without re-implementing anything else by copying the

EntropyMeasure()

and adopting it.

Lets dig a bit deeper — The Titanic disaster 🚢

I kind of a fan of learning by example. That’s why we I think I show you some more of the trees features by using an concrete example instead of just just some generated data.

The data set

We will go on using the famous 101 machine learning data set: the titanic disaster data set. We do this for convenience. It’s quite an easy data set which is not too big but features multiple different data types and missing values while not being entirely trivial. On top of that it’s understandable by a human and many people already worked with that data set. To bring us all on the same page, the data looks like the following:

You might notice that there all sorts of attributes. Numerical, Categorical, Integer-Types and even missing values (look at the Cabin-column). The task is to predict if a passenger survived the titanic disaster by being provided with the given passenger information. A description of the attribute-meanings is given here.

When you walk through ML-tutorials using that data set, they will do all sorts of preprocessings in order to be able to use it with the common machine learning models, e.g. by removing missing values (

NaN

) by Imputing [12] or dropping rows / columns, one-hot-encoding [13] categorical data (e.g., Embarked & Sex) or binning values in order to have a valid data set which ML-model accepts.

This is technically not necessary for

HDTree

. You can feed the data as is and it will happily accept it. Only change the data if you do some actual feature engineering. That’s an actual simplification to get started.

Train the first HDTree on the titanic data

We will just take the data as is and feed it to the model. The basic code is like the above, however in this example there will be way more splits allowed.

Let’s dive a bit into what we see. We have created a Decision Tree having three levels, which chose to use 3 out of the 4 possible SplitRules. They are marked with S1, S2, S3, respectively. I will shortly explain what they do.

S1: FixedValueSplit. This split works on categorical data and picks one of the possible values. The data then is partitioned into one part having that value set and one part not having it set. e.g PClass=1 and Pclass ≠ 1.
S2: (Twenty)QuantileRangeSplit. Those will work on numerical data. They will divide the relevant value-range of the evaluated attribute into a fixed amount of quantiles [14] and span intervals which range over consecutive subsets thereof (e.g., quantile 1 to quantile 5). Each quantile includes the exact same amount of data points. The start quantile and the end quantile (=size of the interval) is searched to optimize the information measure. The data is split into (i) having the value within that interval or (ii) outside of it. There are splits for different amounts of quantiles available.
S3: (Twenty)QuantileSplit. Similar to the Range-Splits (S2), but will separate the data on a threshold value. This is basically what normal decision trees do, except that they commonly try every possible threshold instead of a fixed number thereof.

You might have noticed, that the SingleCategorySplit was not used. I will still take the chance to explain it since it will pop up later:

S4: SingleCategorySplit. This one will work similar to the
```
FixedValueSplit
```
. However, it will create a child node for every single possible value, e.g.: for the attribute PClass those would be 3 children (each for Class 1, Class 2 and Class 3). Note that the FixedValueSplit is identical to the SingleValueSplit if there are only two possible categories.

The individual splits are somewhat smart in regards to the data types/values they’re accepting. Up to some extend they know, under which circumstances they apply and under which they don’t.

This HDTree was also trained on training data using a 2:1 train-/test split. The performance is at 80.37% train accuracy and 81,69 test accuracy. Not too bad.

Restricting the the Splits

Let’s assume you are for some reason not too happy with the decisions found. Maybe you decide that having the very first split on top of the tree is too trivial (splitting on the attribute sex). HDTree got you covered. The easiest solution would be to disallow the FixedValueSplit (and for that matter the equivalent SingleCategorySplit) from appearing at the top. That is fairly easy. You will have to change the initialization of the Splits as follows:

I will present the resulting HDTree as whole, since we also can observe the missing split (S4) there inside the newly generated tree.

By disallowing the split on the attribute

sex

appearing at the root (due to parameter min_level=1
; hint: surely enough you can also provide a

max_level

), we totally restructured the tree. Its performance now is at 80.37% and 81,69% (train/test). It didn’t change at all, even if we have taken away the supposedly best split at the root node.

Due to the fact that decision trees are building up in a greedy fashion, they will only find a local best split for each node, which isn’t necessarily the globally best option. Actually finding an ideal solution to the Decision Tree problem is proven to be NP-Complete [15] long ago.

So the best we can wish for are heuristics. Back to the example: Notice that we already got a non-trivial data insight? While it is trivial to say that men will only have a low chance of survival, less so might be the finding that being a man in the first or second class (

PClass

) departing from Cherbourg (

Embarked=C

) might increase your odds on survival. Or that even if you’re a man in PClass 3
but you’re under 33 years, your odds also increase? Remember: it’s woman and children first. Its a good exercise to make those findings on your own by interpreting the visualization. Those findings were only possible due to restricting the tree.

Who knows what else might be to uncover by using different restrictions? Try it out!

As a last example of that kind, I want to show you how to restrict splits to specific attributes. This not only can be used to prevent the tree from learning unwanted correlations or force alternative solutions but also narrows down the search space. Especially when using multivariate splits this may drastically decrease the runtime. If you watch back to the previous example, you might spot the node which checks for the attribute

PassengerId

, which might be a think we do not want to model, since it at least should not contribute to the information if a passenger survives. A check on that might be a sign of overfitting again. Let’s change that by using the parameter blacklist_attribute_indices.

You might wonder why

name length

would appear at all. Consider that having long names (double names or (noble) titles) may indicate a wealthy background increasing your survival chances.

Additional hint: You can always add the same
SplitRule
type twice. If you want to blacklist an attribute only for certain level(s) of the
HDTree
, just add the same
SplitRule
without that restriction around that level(s).

Predicting data points
As you might already have noticed, you can basically use the common scikit-learn interface to predict data. That is the

.predict()

.predict_proba()

and also the

.score()

-method. But you can go further. There is an additional

.explain_decision()

-method, which can print out a textual representation of a decision as follows (

hdtree_titanic_3

is supposed to be the last change we made on the tree):

will print:

Query: 
 {'PassengerId': 273, 'Pclass': 2, 'Sex': 'female', 'Age': 41.0, 'SibSp': 0, 'Parch': 1, 'Fare': 19.5, 'Cabin': nan, 'Embarked': 'S', 'Name Length': 41}

Predicted sample as "Survived" because of: 
Explanation 1:
Step 1: Sex doesn't match value male
Step 2: Pclass doesn't match value 3
Step 3: Fare is OUTSIDE range [134.61, ..., 152.31[(19.50 is below range)
Step 4: Leaf. Vote for {'Survived'}

This even works for missing data. Let’s set attribute index 2 (Sex) to missing (None):

Query: 
 {'PassengerId': 273, 'Pclass': 2, 'Sex': None, 'Age': 41.0, 'SibSp': 0, 'Parch': 1, 'Fare': 19.5, 'Cabin': nan, 'Embarked': 'S', 'Name Length': 41}

Predicted sample as "Death" because of: 
Explanation 1:
Step 1: Sex has no value available
Step 2: Age is OUTSIDE range [28.00, ..., 31.00[(41.00 is above range)
Step 3: Age is OUTSIDE range [18.00, ..., 25.00[(41.00 is above range)
Step 4: Leaf. Vote for {'Death'}
---------------------------------
Explanation 2:
Step 1: Sex has no value available
Step 2: Pclass doesn't match value 3
Step 3: Fare is OUTSIDE range [134.61, ..., 152.31[(19.50 is below range)
Step 4: Leaf. Vote for {'Survived'}
---------------------------------

This will print all the decision paths (which are more than one because at some nodes no decision can be made!). The final result will be the most common class of among all leaves.

... other useful things to do

You can go ahead and get a representation of the tree as text just by printing it:

Level 0, ROOT: Node having 596 samples and 2 children with split rule "Split on Sex equals male" (Split Score: 0.251)
-Level 1, Child #1: Node having 390 samples and 2 children with split rule "Age is within range [28.00, ..., 31.00[" (Split Score: 0.342)
--Level 2, Child #1: Node having 117 samples and 2 children with split rule "Name Length is within range [18.80, ..., 20.00[" (Split Score: 0.543)
---Level 3, Child #1: Node having 14 samples and no children with 
- SNIP -

or access all nodes that are clean (have high score)

['Node having 117 samples and 2 children with split rule "Name Length is within range [18.80, ..., 20.00[" (Split Score: 0.543)',
 'Node having 14 samples and no children with split rule "no split rule" (Node Score: 1)',
 'Node having 15 samples and no children with split rule "no split rule" (Node Score: 0.647)',
 'Node having 107 samples and 2 children with split rule "Fare is within range [134.61, ..., 152.31[" (Split Score: 0.822)',
 'Node having 102 samples and no children with split rule "no split rule" (Node Score: 0.861)']

There is a lot more to explore. More than I can write into one story (although I might extend this or that later). Feel free to check out other methods and ask whatever question you might have in the comments. If you’re interested I will write a follow up on more need-greedy details 😃.

3. Extending HDTree 🐍

The most valuable thing, which you may want to add to the system is your custom

SplitRule

. A split rule really can do whatever it wants to do in order to separate the data. You can implement a

SplitRule

by implementing the

AbstractSplitRule

class

, which is quite complicated, since you would have to handle data acceptance, performance evaluation and all of that on your own. For that reasons there are Mixins within the package which you can add into the implementation depending on the type of split you want to create which do most of the hard part for you.

Since the article is quite lengthy already, I will not provide a complete how to within another article, which I hopefully will be able to write within the next weeks to come. If you have questions meanwhile, I will happily respond, though. Just drop me a message! The best for now is just take one existing Split Rule, copy it and adapt it to your demands. If you’re not too fluent in Python or OOP [11], some syntax may be overwhelming. Anyways, you do not need to understand every bit and bolt of it to make it work. Most stuff is quite self-explanatory. Many of the functions are just for displaying-purposes in order to generate human-readable texts.

I hope you enjoyed your time reading this story and maybe even learned something new on the go. If I made some mistakes (I surely did 😉), or you have any questions regarding the article or the software: feel free to drop me a message.

Again, please be aware that the Software will be flawed. Please do not drop small bugs here, but rather in the Repo. Same goes for suggestions or feature requests etc. .

Thanks again for your time 👍 🎉🎉

Bibilography 📖

[1] Wikipedia article on Decision Trees https://en.wikipedia.org/wiki/Decision_tree

[2] Medium 101 article on Decision Trees https://medium.com/machine-learning-101/chapter-3-decision-tree-classifier-coding-ae7df4284e99

[3] Breiman, Leo, Joseph H Friedman, R. A. Olshen and C. J. Stone. “Classification and Regression Trees.” (1983).

[4] scikit-learn documentation: Decision Tree Classifier https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision%20tree#sklearn.tree.DecisionTreeClassifier

[5] Cython project page https://cython.org

[6] Wikipedia article on pruning: https://en.wikipedia.org/wiki/Decision_tree_pruning

[7] sklearn documentation: plot a Decision Tree https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html

[8] Wikipedia article Support Vector Machine https://de.wikipedia.org/wiki/Support_Vector_Machine

[9] MLExtend Python library http://rasbt.github.io/mlxtend/

[10] Wikipedia Article Entropy in context of Decision Trees https://en.wikipedia.org/wiki/ID3_algorithm#Entropy

[11] Wikepedia Article about OOP https://en.wikipedia.org/wiki/Object-oriented_programming

[12] Wikipedia Article on imputing https://en.wikipedia.org/wiki/Imputation_(statistics)#:~:text=In%20statistics%2C%20imputation%20is%20the,known%20as%20"item%20imputation

[13] Hackernoon article about one-hot-encoding https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f

[14] Wikipedia Article about Quantiles https://en.wikipedia.org/wiki/Quantile

[15] Hyafil, Laurent; Rivest, Ronald L. “Constructing optimal binary decision trees is NP-complete” (1976)

[16] Hackernoon Article on Decision Trees https://hackernoon.com/what-is-a-decision-tree-in-machine-learning-15ce51dc445