Introducing a customizable and interactable Decision Tree-Framework written in Python Fast Track Repository: https://github.com/Mereep/HDTree Complementary Notebook: inside directory of the repository or directly (every you see here will be in the notebook. You will be able to ) examples here illustration generated create them on your own What’s inside the story? This story will introduce yet another implementation of Decision Trees, which I wrote as part of my thesis. The work will be divided into three chapters as follows: , I will try to motivate why I have decided to take my time to come up with an own implementation of Decision Trees; I will list some of its but also will list the of the current implementation. Firstly features disadvantages , I will guide you through the using code snippets and explaining some details along the way. Secondly basic usage of HDTree , there will be some hints on how to customize and extend the with your own chunks of ideas. Lastly HDTree However, this article will guide you through all of the of Decision Trees. There are really plenty of resources out there [1][2][3][16]. I think there is no need in repeating all of that again. Others have done that. I will not be able to do it better. You to be an in Decision Trees to understand this article. A basic level of understanding should be sufficient to follow up. Some experience in the ML domain is a plus, though. not basics don’t need expert Motivation & Background For my work I came along working with Decision Trees. My actual goal is to implement an human-centric ML-model, where (Human Decision Tree for that matter) is an optional ingredient which is used as part of an actual user interface for that model. While this story solely focuses on HDTree, I might write a follow-up describing the other components in detail. HDTree Features of HDTree & Comparison with scikit learn Decision Trees Naturally, I stumbled upon the -implementation⁴ of decision trees. I guess many practitioners do. And lets make something clear from the beginning: nothing is wrong with it. scikit-learn The has a lot of : sckit-learn implementation pros it’s fast & optimized The implementation is written in a dialect called C ⁵. Cython compiles to C-Code (which in turn compiles to native code) while maintaining interoperability with the Python interpreter. ython it’s easy and to use and convinient Many people in the ML-domain know how to work with scikit-learn models. You will easily find help everywhere due to its user base. it’s battle tested (a lot of people are using it) It just works it supports many pre-pruning and post-pruning [6] methods and provides many features (e.g., Minimal Cost-Complexity Pruning³ or sample weights) it support basic visualization [7] That said, surely it also has some : shortcomings t’s not trivial to modify, partly due to the usage of the rather uncommon Cython dialect (see advantages above) no way to incorporate user knowledge about the domain or to modify the learning process the visualization is rather minimalistic no support for categorical attributes / features no support for missing values interface for accessing nodes and traversing the tree is cumbersome an not intuitive no support for missing values only binary splits (see later) no multivariate splits (see later) Features HDTree HDTree comes with a solution to most of the shortcomings mentioned in the above list, while sacrificing many of the advantages of the scikit-learn implementation. We will come back to those points later, so don’t worry if you do not understand every part of the following list yet: 👍 interact with the learning-behavior 👍 core components are modular and fairly easy to extend (implement an interface) 👍 purely written in Python (more approachable) 👍 rich visualization 👍 support 👍 support for 👍 support for 👍 easy interface to navigate through the tree structure 👍 supports for (> 2 child nodes) 👍 textual representations of decision paths 👍 encourages explainability by printing human-readable text ------------------------------------------------------------------------------------------------- 👎 slow 👎 not battle-tested (it have bugs) 👎 mediocre software quality 👎 not so many pruning options (it supports some basic options, though) categorical data missing values multivariate splits n-ary splits will ⚠️ Although the disadvantages seem to be not too numerous, they are critical. Let us make that clear right away: Do not throw big data at it. You will wait forever. Do not use it in production. It may break unexpectedly. You have been warned!⚠️ Some of these problems may get fixed over time. However, the training speed probably will remain slow (inference is okay, though). You will have to come up with a better solution to fix that. You are very welcome to contribute 😃. That said, what would be possible ? use cases extract knowledge from your data test the intuition you have about your data understand the inner workings of decision trees explore alternative causal relationships regarding to your learning problem use it as part of your more complex algorithms create reports and visualizations use it for any research-related purposes have an accessible platform to easily test your idea idea for decision tree algorithms Structure of Decision Trees Although this work will not review decision trees in detail, we will recap their basic building blocks. This will provide a basis to understand the examples later and also highlights some of HDTrees features. The following graphic is an (except for the markers) actual output of the HDTree a (Nodes) textual description of the which was used at this node to separate the data into its children. Displays the relevant attribute(s) and the verbal description of the operation. Those tests are and can include you can come up with. Development of your own custom rules is supported by implementing an interface. More on that in section 3. ai: test/split rule highly configurable any data-separating logic the of the node measures its pureness, i.e., how good the data that passes through the node is separated. The higher the score, the better. High scores are also represented by the nodes’ background color. The more greenish, the higher the score (white means unpure, i.e. equally distributed classes). Those scores direct the induction (building) process of the tree and are a aii: score modular and exchangeable component of the HDTree. : the nodes’ indicates how many data points are passing through this node. The thicker the border the more data is passing through that node. aiii border list of the targets / labels (i.e., the prediction goal) which the data points passing that node have. The most common class is marked with a ✔. aiv: a visualization can mark the path that an individual data points follows (illustrate the decision that are made when the data point passes the tree). This is marked with a line at the corner of the decision tree. av: optionally b (Edges) the arrow connect each possible outcome of the split (ai) with its child nodes. The more of the data (relatively to the parent) is ‘flowing’ along the edge, the thicker it is drawn. bi: each edge has a readable textual representation of the splits’ corresponding outcome. bii: Why different splits / tests? At this point of reading you might already wonder, what is different from the HDTree to a scikit-learn tree (or any other implementation) and why we might want to have different kinds of splits? Let’s try to make this clear. Maybe you have an intuitive understanding of the notion of All the data we are working with lies in some multi-dimensional space which is defined by the amount and the type of the attributes / features your data have. feature space. The task of a now is to this into non-overlapping regions and those region a / target / outcome, whatever you like to call it. Let's visualize that. Since our brains have troubles fiddling with high dimensionality, we will stick with a 2-dimensional example and a very simple 2-class problem as follows: classification algorithm partition space assigning class You see a very simple data set, which is made of two dimensions (features / attributes) and two classes. The generated data points were normal distributed at the center. A street which is just the linear function separates those two classes into (South East) and (North West). Also, some random noise (blue data points in orange area and vice versa) was added in order to illustrate effects of overfitting later on. f(x) = y Class 1 Class 2 The task of a classification algorithm like HDTree (though, it can also be used for ) is to learn, which class each data point belongs to. In other words: given some pair of coordinates like the task is to learn if this coordinate belongs to the orange Class 1 or the blue Class 2. A discriminative model will try to separate the feature space (here the ( ) into blue and orange territories respective. regression tasks (x, y) (6, 2) x,y) -axis Given that data, the decision (rules) on how data would be separated (classified) seems very easy. A reasonable human being would say : (think on your own first) “It’s Class 1 if class 2”. x > y, otherwise The perfect separation would created by the function which is illustrated as reference as dashed line. Indeed a like a would come up with a similar solution. But let’s see what decision trees tempt to learn instead: y=x large margin classifier support vector machine [8] The image illustrates the regions in which a standard decision tree with increasing depths would classify a data point as class 1 (orange) or class 2 (blue). A decision tree will approximate the linear function using a step function This is due to the type of test / split rule which decision trees use. They are all of the scheme which will result in . In the 2D-space its just like rectangles getting “cut out”. In 3D-space it would be cuboids… and so on. attribute < threshold axis-parallel hyperplanes Furthermore, the decision tree starts to within the data when having 8 levels already, i.e., it is . While doing so it never finds a good approximation of the real linear function. To verify this, I used a typical train test split (2:1) of the data and evaluated the trees' scores, which are for the t and for the (ordered by the trees' depth 4, 8, 16). While the the model the noise overfitting 93.84%, 93,03%, 90,81% test se 94,54%, 96,57%, 98,81% training set test accuracy is decreasing training accuracy is increasing. Having an increasing training performance and a dropping test performance is the hallmark of overfitting The resulting decision trees are also quite complex for such a simple function. The most simple of the them (depth 4) visualized by scikit learn looks already like that: Quite a bummer, I will save you from the more complex ones. The next section will start by solving that problem using the HDTree package. will enable the user to about the data (just like knowing it is linearly separable in the example). Also it will allow you finding playing with alternative solutions to your problem. HDTree incorporate knowledge Using the HDTree package 🌲 This section will guide you through the basics of HDTree. I will try to touch some parts of it API. Please feel free to ask in the comments section or contact me, if you have any questions on that. I will happily answer and extend the article if needed. Installing HDTree 💻 This is slightly more complicated than . Sorry 😅. pip install hdtree Create an empty directory and create an folder named “hdtree” inside there ( ) your_folder/hdtree Download or clone the into the hdtree directory (not in a sub sub directory) repository Install the necessary dependencies / packages (numpy, pandas, graphviz, sklearn, Python ≥ 3.5) Add into your . This will include it into Pythons importing mechanism. You can use it as a common python package then your_folder python search path Alternatively you could add to the folder of your python installation. I might add an installation file later. At the moment of writing, the code is not available in the pip repository. hdtree site-packages All codes that generate the graphics and outputs below (and also the previously seen) are inside the repository or directly hosted . here 📈 Solving this linear problem with a one-level decision tree Let’s start with some code right away: Yes, the resulting tree is only one level high and has the ideal solution to that problem. Admittedly, this example is somehow staged to show the effect. Still, I hope it makes the point clear: having an intuition about the data or by just providing the decision tree with various options to partition the feature space, it might come up with an and sometimes even . simpler more accurate solution Imagine having to interpret the rules from those previous trees presented until here to find some actionable information. Which one you can first and which one you would more? The complicated one which uses multiple step functions or the small accurate tree? I guess the answer is quite simple. understand trust But let’s dig a bit into the actual code. When initializing the HDTreeClassifier, the most important thing you’ll have to provide are the There you provide a list containing possible tests / split rules which the algorithm tries at construction / training time for every node in order to find a good local partition of the data. allowed_splits . In this case we solely supplied a This split does exactly what you see: it takes two attributes (it will try any combination) and separates the data in the scheme Which (not too much by random) fits the given data as good as it could be. SmallerThanSplit . a_i < a_j . This type of split is denoted as a Which means the split is using for the partition-decision. This is unlike the that are used in most other trees like the scikit-tree (see above for details) that take exactly into account. Surely, also has options to achieve a ‘normal split’ like the ones in the scikit-trees ( -Family). We will learn more splits along our journey. multivariate split. more than one attribute univariate splits one attribute HDTree QuantileSplit The other unfamiliar thing you might see in the code is the models hyper parameter There, you provide the model with the measurement that is used to estimate the value of a single node or a complete split (a parent node with its children). The chosen option here is based on Entropy [10]. Maybe you have also heard of the , which would be another valid option here. Surely enough you can measures by just the appropriate . If you’d like: go ahead and implement the , which you could drop directly into the tree without re-implementing anything else by copying the and adopting it. information_measure . Gini-Index provide your very own information implementing interface Gini-Index EntropyMeasure() 🚢 Lets dig a bit deeper — The Titanic disaster I kind of a fan of learning by example. That’s why we I think I show you some more of the trees features by using an concrete example instead of just just some generated data. The data set We will go on using the famous 101 machine learning data set: the titanic disaster data set. We do this for convenience. It’s quite an easy data set which is not too big but features multiple different data types and missing values while not being entirely trivial. On top of that it’s understandable by a human and many people already worked with that data set. To bring us all on the same page, the data looks like the following: You might notice that there all sorts of attributes. Numerical, Categorical, Integer-Types and even missing values (look at the Cabin-column). The task is to predict if a passenger survived the titanic disaster by being provided with the given passenger information. A description of the attribute-meanings is given . here When you walk through ML-tutorials using that data set, they will do all sorts of in order to be able to use it with the common machine learning models, e.g. by removing missing values ( ) by dropping rows / columns, [13] categorical data (e.g., & ) or binning values in order to have a valid data set which ML-model accepts. preprocessings NaN Imputing [12] or one-hot-encoding Embarked Sex This is technically not necessary for . and it will happily accept it. Only change the data if you do some actual feature engineering. That’s an actual simplification to get started. HDTree You can feed the data as is Train the first HDTree on the titanic data We will just take the data as is and feed it to the model. The basic code is like the above, however in this example there will be way more splits allowed. Let’s dive a bit into what we see. We have created a Decision Tree having three levels, which chose to use 3 out of the 4 possible They are marked with I will shortly explain what they do. SplitRules. S1, S2, S3, respectively. S1: This split works on categorical data and picks one of the possible values. The data then is partitioned into one part having that value set and one part not having it set. e.g and . FixedValueSplit. PClass=1 Pclass ≠ 1 S2: (Twenty) . Those will work on numerical data. They will the relevant value-range of the evaluated attribute a fixed amount of [14] and span which range over thereof (e.g., quantile 1 to quantile 5). Each quantile includes the exact same amount of data points. The start quantile and the end quantile (=size of the interval) is searched to optimize the . The data is split into (i) having the value within that interval or (ii) outside of it. There are splits for different amounts of quantiles available. QuantileRangeSplit divide into quantiles intervals consecutive subsets information measure S3: (Twenty) . Similar to the Range-Splits (S2), but will separate the data on a threshold value. This is basically what normal decision trees do, except that they commonly try every possible threshold instead of a fixed number thereof. QuantileSplit You might have noticed, that the was not used. I will still take the chance to explain it since it will pop up later: SingleCategorySplit S4: This one will work similar to the . However, it will create a child node for every single possible value, e.g.: for the attribute those would be 3 children (each for ). SingleCategorySplit. FixedValueSplit PClass Class 1, Class 2 and Class 3 Note that the FixedValueSplit is identical to the SingleValueSplit if there are only two possible categories. The individual splits are somewhat in regards to the they’re . Up to some extend they know, under which circumstances they apply and under which they don’t. This HDTree was also trained on training data using a 2:1 train-/test split. The performance is at and . Not too bad. smart data types/values accepting 80.37% train accuracy 81,69 test accuracy Restricting the the Splits Let’s assume you are for some reason not too happy with the decisions found. Maybe you decide that having the very first split on top of the tree is too trivial (splitting on the attribute ). HDTree got you covered. The easiest solution would be to disallow the (and for that matter the equivalent ) from appearing at the top. That is fairly easy. You will have to change the initialization of the Splits as follows: sex FixedValueSplit SingleCategorySplit I will present the resulting HDTree as whole, since we also can observe the missing split (S4) there inside the newly generated tree. By disallowing the split on the attribute appearing at the root (due to parameter ; hint: surely enough you can also provide a ), we totally restructured the tree. Its performance now is at 80.37% and 81,69% (train/test). It didn’t change at all, even if we have taken away the supposedly best split at the root node. sex min_level=1 max_level greedy fashion local globally Due to the fact that decision trees are building up in a , they will only find a best split for each node, which isn’t necessarily the best option. Actually finding an ideal solution to the Decision Tree problem is proven to be NP-Complete [15] long ago. So the best we can wish for are heuristics. Back to the example: Notice that we already got a non-trivial data insight? While it is trivial to say that men will only have a low chance of survival, less so might be the finding that being a man in the first or second class ( ) departing from ( might increase your odds on survival. Or that even if you’re a man in but you’re under 33 years, your odds also increase? Remember: it’s woman children first. Its a good exercise to make those findings on your own by interpreting the visualization. Those findings were only possible due to restricting the tree. PClass Cherbourg Embarked=C ) PClass 3 and Who knows what else might be to uncover by using different restrictions? As a last example of that kind, I want to show you how to . This not only can be used to the tree from learning or force but also down the . Especially when using multivariate splits this may drastically decrease the runtime. If you watch back to the previous example, you might spot the node which checks for the attribute , which might be a think we do not want to model, since it at least not contribute to the information if a passenger survives. A check on that might be a sign of again. Let’s change that by using the parameter . Try it out! restrict splits to specific attributes prevent unwanted correlations alternative solutions narrows search space PassengerId should overfitting blacklist_attribute_indices You might wonder why would appear at all. Consider that having long names (double names or (noble) titles) may indicate a wealthy background increasing your survival chances. name length Additional hint: You can always add the same twice. If you want to blacklist an attribute only for certain level(s) of the , just add the same without that restriction around that level(s). SplitRule type HDTree SplitRule As you might already have noticed, you can basically use the common scikit-learn interface to predict data. That is the and also the -method. But you can go further. There is an additional -method, which can print out a textual representation of a decision as follows ( is supposed to be the last change we made on the tree): Predicting data points .predict() , .predict_proba() .score() .explain_decision() hdtree_titanic_3 will print: Query: {'PassengerId': 273, 'Pclass': 2, 'Sex': 'female', 'Age': 41.0, 'SibSp': 0, 'Parch': 1, 'Fare': 19.5, 'Cabin': nan, 'Embarked': 'S', 'Name Length': 41} Predicted sample as "Survived" because of: Explanation 1: Step 1: Sex doesn't match value male Step 2: Pclass doesn't match value 3 Step 3: Fare is OUTSIDE range [134.61, ..., 152.31[(19.50 is below range) Step 4: Leaf. Vote for {'Survived'} This even works for missing data. Let’s set attribute index 2 ( ) to missing ( ): Sex None Query: {'PassengerId': 273, 'Pclass': 2, 'Sex': None, 'Age': 41.0, 'SibSp': 0, 'Parch': 1, 'Fare': 19.5, 'Cabin': nan, 'Embarked': 'S', 'Name Length': 41} Predicted sample as "Death" because of: Explanation 1: Step 1: Sex has no value available Step 2: Age is OUTSIDE range [28.00, ..., 31.00[(41.00 is above range) Step 3: Age is OUTSIDE range [18.00, ..., 25.00[(41.00 is above range) Step 4: Leaf. Vote for {'Death'} --------------------------------- Explanation 2: Step 1: Sex has no value available Step 2: Pclass doesn't match value 3 Step 3: Fare is OUTSIDE range [134.61, ..., 152.31[(19.50 is below range) Step 4: Leaf. Vote for {'Survived'} --------------------------------- This will print all the decision paths (which are more than one because at some nodes no decision can be made!). The final result will be the most common class of among all leaves. ... other useful things to do You can go ahead and get a representation of the tree as text just by printing it: Level , ROOT: Node having samples children with split rule (Split Score: ) -Level , Child # : Node having samples children with split rule (Split Score: ) --Level , Child # : Node having samples children with split rule (Split Score: ) ---Level , Child # : Node having samples no children with - SNIP - 0 596 and 2 "Split on Sex equals male" 0.251 1 1 390 and 2 "Age is within range [28.00, ..., 31.00[" 0.342 2 1 117 and 2 "Name Length is within range [18.80, ..., 20.00[" 0.543 3 1 14 and or access all nodes that are clean (have high score) [' samples children with split (Split Score: )', ' samples no children with split ( Score: )', ' samples no children with split ( Score: . )', ' samples children with split (Split Score: )', ' samples no children with split ( Score: . )'] Node having 117 and 2 rule "Name Length is within range [18.80, ..., 20.00[" 0.543 Node having 14 and rule "no split rule" Node 1 Node having 15 and rule "no split rule" Node 0 647 Node having 107 and 2 rule "Fare is within range [134.61, ..., 152.31[" 0.822 Node having 102 and rule "no split rule" Node 0 861 There is to explore. More than I can write into one story (although I might extend this or that later). Feel free to check out other methods and ask whatever question you might have in the comments. If you’re interested I will write a follow up on more need-greedy details 😃. a lot more 3. Extending HDTree 🐍 The most valuable thing, which you may want to add to the system is your custom split rule really can do whatever it wants to do in order to . You can implement a by implementing the - , which is quite complicated, since you would have to handle data acceptance, performance evaluation and all of that on your own. For that reasons there are within the package which you can add into the implementation depending on the type of split you want to create which do most of the hard part for you. SplitRule . A separate the data SplitRule AbstractSplitRule class Mixins Since the article is quite lengthy already, I will not provide a complete how to within another article, which I hopefully will be able to write within the next weeks to come. If you have questions meanwhile, I will happily respond, though. Just drop me a message! The best for now is just take one , to your demands. If you’re not too fluent in Python or OOP [11], some syntax may be overwhelming. Anyways, you do not need to understand every bit and bolt of it to make it work. Most stuff is quite self-explanatory. Many of the functions are just for displaying-purposes in order to generate human-readable texts. existing Split Rule copy it and adapt it I hope you enjoyed your time reading this story and maybe even learned something new on the go. If I made some mistakes (I surely did 😉), or you have any questions regarding the article or the software: feel free to drop me a message. Again, please be aware that the Software will be flawed. Please do not drop small bugs here, but rather in the Repo. Same goes for suggestions or feature requests etc. . 👍 🎉🎉 Thanks again for your time Bibilography 📖 [1] Wikipedia article on Decision Trees https://en.wikipedia.org/wiki/Decision_tree [2] Medium 101 article on Decision Trees https://medium.com/machine-learning-101/chapter-3-decision-tree-classifier-coding-ae7df4284e99 [3] Breiman, Leo, Joseph H Friedman, R. A. Olshen and C. J. Stone. “Classification and Regression Trees.” (1983). [4] scikit-learn documentation: Decision Tree Classifier https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html?highlight=decision%20tree#sklearn.tree.DecisionTreeClassifier [5] Cython project page https://cython.org [6] Wikipedia article on pruning: https://en.wikipedia.org/wiki/Decision_tree_pruning [7] sklearn documentation: plot a Decision Tree https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html [8] Wikipedia article Support Vector Machine https://de.wikipedia.org/wiki/Support_Vector_Machine [9] MLExtend Python library http://rasbt.github.io/mlxtend/ [10] Wikipedia Article Entropy in context of Decision Trees https://en.wikipedia.org/wiki/ID3_algorithm#Entropy [11] Wikepedia Article about OOP https://en.wikipedia.org/wiki/Object-oriented_programming [12] Wikipedia Article on imputing "item%20imputation https://en.wikipedia.org/wiki/Imputation_(statistics)#:~:text=In%20statistics%2C%20imputation%20is%20the,known%20as%20 [13] Hackernoon article about one-hot-encoding https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f [14] Wikipedia Article about Quantiles https://en.wikipedia.org/wiki/Quantile [15] Hyafil, Laurent; Rivest, Ronald L. “Constructing optimal binary decision trees is NP-complete” (1976) [16] Hackernoon Article on Decision Trees https://hackernoon.com/what-is-a-decision-tree-in-machine-learning-15ce51dc445