Back # The Complete Guide to Decision Trees

December 10, 2019 Explorium Data Science Team Data Science

In the world of machine learning, developers can create independent environments for projects easily. It only takes a few clicks to set and fit models in order to achieve solid results. Yet, many algorithms can be quite difficult to understand, let alone explain. Decision trees, while performing poorly in their basic form, are easy to understand and when stacked (Random Forest, XGBoost) reach excellent results.

Let’s take a deeper dive.

## The history of decision trees

Here’s a quick look at the history of decision trees:

• 1963: The Department of Statistics at the University of Wisconsin–Madison writes that the first regression tree was invented in 1963 (AID project, Morgan and Sonquist). It had an impurity measure (we’ll get to that soon) and recursively split data into two subsets.
• 1666: The Institute of Computing Science in the Poznań University of Technology states that one of the first publications on decision trees was in 1966 (by Hunt). In psychology, the decision tree methods were used to model the human concept of learning. Exploring the human mind, researchers discovered the algorithm was useful for programming.
• 1972: The first classification tree appeared in the THAID project (by Messenger and Mandell). It worked via splitting data to maximize the sum of cases in the modal category. The predicted class was a mode.
• 1974: Statistics professors Leo Breiman and Charles Stone from Berkeley and Jerome Friedman and Richard Olshen from Stanford started developing the classification and regression tree (CART) algorithm.
• 1977: Breiman, Stone, Friedman, and Olshen invented the first CART version.
• 1984: The official publication with a CART software. It was a revolution in the world of algorithms. Even today, CART is one of the most used methods for data analysis. Main upgrades include truncating unnecessary trees, tunneling, and choosing the best tree version. CART became a world-standard for decision trees, and its development kept progressing.
• 1986: John Ross Quinlan proposed a new concept: trees with multiple answers. Important note: CART and all other algorithms only have two answers for each question (called binary trees). Quinlan invented ID3 (Iterative Dichotomiser 3) using an impurity criterion called gain ratio. It could cover other questions and propose some resulting nodes. ID3 wasn’t ideal, so its author continued to upgrade the algorithm structure. And this is how C4.5 was born. C4.5 addressed the shortcomings of its predecessor, ranking #1 in the Top 10 Algorithms in Data Mining pre-eminent paper (Springer LNCS, 2008). By that time the algorithm had existed for 15 years. ## How decision trees work The picture above illustrates and explains decision trees by using exactly that, a decision tree.

The idea is quite simple and resembles the human mind. If we tried to split data into parts, our first steps would be based on questions. Step by step, data would separate in unique pieces and, finally, we would split samples. This is the essence of the entire concept.

To explain this concept better, we will use some popular terminology:

• Node: Each object in a tree. Nodes contain subsets of data, and excluding leaf nodes, a question splits the subset.
• Parent node: The question that makes a data split.
• Child node: Resulting node. It also can be a parent for its children.
• Leaf node: Final node with no further questions. Only a subset of the data representing answers to preceding questions.
• Branch: Unique line of the questions with answers that flow to a leaf node.
• Root: The top node.

## Regression and classification

It’s possible to use decision trees with regression or classification. The techniques are slightly different, so we will review both starting with classification. Typically, for classification, the model should be learned on training data with a predefined set of labels. It would predict a label (i.e., class) for new samples.

So, we have a dataset with different attributes (features). Each sample has its own combination of the value of the features. How does a decision tree make assumptions for data splitting?

To answer this question we should first understand the goal and the result of it. The goal is to separate samples with a more similar structure and the result is purer subsets. Purity, in this case, responds to the quality of the splitting. There are a few typical criteria for this:

• Gini impurity: a measurement that takes a sum of the ratios for each class instance to entire node instances. After that, it subtracts 1 from it. An ‘Ideal’ node would have a 0 impurity. • Informational entropy: based on the entropy concept. In short, the more evenly arranged things are in a certain space, the greater the entropy. The differences between Gini impurity and informational entropy are not huge. Gini impurity makes a purer separation, but entropy tends to make more balanced subsets. Informational entropy is more expensive from a computational standpoint.

So, basically, the algorithm looks between all features for the highest impurity splitter, where the parent node will have bigger impurity than it’s children. A split’s effectiveness is measured by its impurity and by the amount of data in each child node, aspiring to minimize the size division between the child nodes.

Finally, it will build a tree that will have a bunch of questions with answers. For a prediction, the model just needs to guide through the tree for the concrete leaf. Each leaf has probabilities for each existing class, so the model will choose higher. That’s it. A classification tree is ready.

Now let’s talk about a regression task. A regression is when a target value doesn’t have a predefined set of labels, but a numerical result. The main problem here is predicting new target values based on training data.

The regression technique is not very different from the classification. The main distinction is the measurement for the splitting. Instead of the impurity, regression trees use mean squared error (MSE). Questions gradually reduce MSE until it reaches a minimum. The node result is the average target value from all samples in it. MSE is calculated like this: After fitting, the model guides a new object to the correct branch and assigns the average target value. Simple and powerful. A regression tree is complete.

## Implementation options in modern libraries

Python accounts for a large number of powerful ML libraries. Some of them are widespread and a must-know for everybody, others are rare and very specialized. Here, we will cover searching for useful decision tree implementations.

Of course, you can write the code by yourself. Each famous algorithm has a detailed description, so with basic language knowledge, it shouldn’t be too difficult. But sometimes it can waste precious development time because, instead of the model set, you need to create an implementation of the learning algorithm.

1. Scikit-learn: CART model. Basic and maybe the most popular implementation of decision trees in a Python. It supports multiple optimization options and can be combined with many other sklearn models and pipelines.
2. Imblearn: Special classifier based on the sklearn model but with larger attention to imbalanced datasets (several classes have a big difference in size).
3. Spark ML library: Decision Tree with some useful features like depth, top node accessibility, etc.
4. Decision-tree-id3: Library with ID3 method for a Python.
5. Eli5: The connection between Eli5 and sklearn libraries with a DTs implementation.

For this article, we will use scikit-learn implementation, because it is fully maintained, stable, and very popular.

## Application of decision trees for forest classification with dataset in Python

Let’s put all of this talk into practice. All that you need is Python 3 on your PC, with previously installed libraries: scikit-learn, Pandas, SciPy, and Jupyter Notebook.

There is no more logical data to learn via decision tree classifier, than … tree classifications. With that in mind we will use data from the forest cover type dataset on Kaggle. You can find the full description on the task page. Our goal is to make a model on this data.

This large dataset is taken from the observations in four areas of the Roosevelt National Forest in Colorado. It contains 54 features and more than 500K records. Each record is a 30-meter by 30-meter forest measurement. The dataset contains seven classes:

1. Spruce/Fir
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

It’s important to note that it’s not necessary to use Jupyter. However, we will do this for visibility. All code from this experiment works in simple Python files and any exceptions will be highlighted.

Let’s start. Run Jupyter Notebook and create a new .ipynb file. In the code section paste the following:

```import pandas as pd
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn import metrics
from sklearn.tree import export_graphviz
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint
```

Here we have imported all the required methods and libraries for all further processes. Don’t worry, if something is new to you, it will be clarified as we continue on.

Now it’s time to take a look at our dataset. Download it from the task page and save it in the project folder.

```data = pd.read_csv('covtype.csv')
```

Output: Wow! 55 columns, 54 of them are features. What are they?

`data.info()`

Output:

```class 'pandas.core.frame.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 55 columns):
Elevation                             581012 non-null int64
Aspect                                581012 non-null int64
Slope                                 581012 non-null int64
Horizontal_Distance_To_Hydrology      581012 non-null int64
Vertical_Distance_To_Hydrology        581012 non-null int64
Horizontal_Distance_To_Fire_Points    581012 non-null int64
Wilderness_Area1                      581012 non-null int64
Wilderness_Area2                      581012 non-null int64
Wilderness_Area3                      581012 non-null int64
Wilderness_Area4                      581012 non-null int64
Soil_Type1                            581012 non-null int64
Soil_Type2                            581012 non-null int64
Soil_Type3                            581012 non-null int64
Soil_Type4                            581012 non-null int64
Soil_Type5                            581012 non-null int64
*****************************************************
Soil_Type40                           581012 non-null int64
Cover_Type                            581012 non-null int64
dtypes: int64(55)
memory usage: 243.8 MB
```

10 features are numerical and clearly explained. The dataset also has 4 ‘Wilderness_AreaN’ and 40 ‘Soil_TypeN’ fields which are boolean, meaning only the areas can be “true” (or 1) for a single observation, all others would be “false” (or 0).

Now we need to follow the golden rule in ML: create a test sample.

```X = data.drop('Cover_Type', axis=1)
y = data['Cover_Type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
```

All columns except the last one is our “X” data. The last column is considered as “y” data. “X” data are features and “y” data are labels for each record. Now we will use the train_test_split method. “X_test” and “y_test” objects will be left until the end of the experiment. This is important because, in the end, we want to have a system that is good at generalizing and predicting new data.

Now let’s work on class proportions.

```labels = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine',
'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']
y_train.index = [labels[i-1] for i in y_train]
y_train.index.value_counts()
```

Output:

```Lodgepole Pine       198162
Spruce/Fir          	148488
Ponderosa Pine        25000
Krummholz             14294
Douglas-fir           12256
Aspen                  6618
Cottonwood/Willow      1890
dtype: int64
```

Note that classes are highly imbalanced, which may cause a problem because decision trees tend to get bigger classes clearer than smaller. As we have so much data, we can use test samples and validation. This is useful for improving model accuracy before the final test. We will explore the same train_test_split method:

```X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2)
```

So, we have 50% of all data in a training set, 30% is testing, 20% is validating. Now it’s time to create our first decision tree classifier!

```clf = DecisionTreeClassifier()
cross_val_score(clf, X_train, y_train, cv=7)
```

Output:

```array([0.92240341, 0.92175634, 0.92169012, 0.92046212, 0.92304051,
0.92256718, 0.92037093])
```

For creating a tree object, we use DecisionTreeClassifier. Instead of direct learning, we adopt the cross-validation technique. This technique splits the entire training set N times with each iteration responding to one learning process. The main reason for this? We can test our data on training samples and get a general learning result. As you’ll see, the accuracy is really good with each fold reaching 92%. For automating this stage, the sklearn has a cross_val_score method.

Okay, the model is good on a training set. But how easy will it handle validation?

```clf.fit(X_train, y_train)
print(metrics.accuracy_score(y_valid, clf.predict(X_valid)))
```

Output:

```0.9276142706105087
```

Wow, the same results on a validation set!

## Optimizing models to improve predictions

Unfortunately, the results are often worse than in our case. Decision trees tend to overfit the training data, which means absolutely remarkable results in the learning process but very low results in testing. Decision trees without limitations on depth or impurity in split will create a very complex tree, with a leaf for each observation. Again, this means perfect results on the training set but a classic case of overfitting that will fail on unseen data. To avoid this, here are some best practices.

### Feature selection

Sometimes features are not only un-useful but also create false importance. Feature selection could fix this problem. DecisionTreeClassifier object has a really helpful attribute, called feature_importances_, that returns the importance of each feature. Let’s see those attributes in our classifier.

`clf.feature_importances_`

Output:

```array([3.37518529e-01, 2.84656713e-02, 1.76777678e-02, 6.27777911e-02,
4.48479825e-02, 1.49326271e-01, 3.13041542e-02, 3.37686965e-02,
2.29711764e-02, 1.39970429e-01, 8.35700869e-03, 4.33672974e-03,
1.47555544e-02, 1.28823512e-03, 1.18746252e-04, 9.60672295e-03,
1.68313126e-03, 1.19379073e-02, 5.20212884e-04, 8.42807227e-04,
0.00000000e+00, 0.00000000e+00, 9.95456477e-05, 3.50105965e-03,
8.58318195e-04, 8.22011973e-04, 2.56493210e-03, 1.48607565e-04,
2.50799190e-05, 6.66360906e-04, 1.86613628e-03, 0.00000000e+00,
7.94106697e-04, 3.17229491e-03, 4.68208634e-04, 7.45929560e-03,
9.06540846e-03, 5.36007876e-03, 6.21906782e-05, 7.14446581e-05,
6.36737159e-04, 4.53213546e-05, 7.24989689e-03, 2.83808978e-03,
4.43184249e-03, 1.23541381e-02, 4.28257875e-03, 4.17733218e-04,
8.51854676e-04, 2.89520728e-05, 2.09356356e-04, 2.42011853e-03,
3.76673702e-03, 1.38603866e-03])
```

Wait a minute, it is difficult to understand! Let’s combine these values with feature names and sort in descending order.

```def sortSecond(val):
return val
values = clf.feature_importances_
features = list(X)
importances = [(features[i], values[i]) for i in range(len(features))]
importances.sort(reverse=True, key=sortSecond)
importances
```

Output:

```[('Elevation', 0.3375185285673537),
('Horizontal_Distance_To_Fire_Points', 0.13997042859131698),
('Horizontal_Distance_To_Hydrology', 0.06277779109891506),
('Vertical_Distance_To_Hydrology', 0.044847982524894914),
('Aspect', 0.028465671348228104),
('Slope', 0.01767776783558953),
('Wilderness_Area3', 0.014755554433198528),
('Soil_Type32', 0.012354138099619834),
('Soil_Type4', 0.011937907301667192),
('Soil_Type2', 0.009606722953780527),
('Soil_Type23', 0.009065408464270798),
('Wilderness_Area1', 0.00835700868960834),
('Soil_Type22', 0.007459295603405485),
('Soil_Type29', 0.0072498968949893825),
('Soil_Type24', 0.0053600787578572855),
('Soil_Type31', 0.004431842489166107),
('Wilderness_Area2', 0.004336729739051701),
('Soil_Type33', 0.004282578752504762),
('Soil_Type39', 0.0037667370218322925),
('Soil_Type10', 0.003501059648836054),
('Soil_Type20', 0.003172294913549069),
('Soil_Type30', 0.002838089780501163),
('Soil_Type13', 0.0025649320984729483),
('Soil_Type38', 0.0024201185325595865),
('Soil_Type17', 0.0018661362800872908),
('Soil_Type3', 0.0016831312614147462),
('Soil_Type40', 0.0013860386558133934),
('Wilderness_Area4', 0.001288235117597936),
('Soil_Type11', 0.0008583181952767269),
('Soil_Type35', 0.0008518546762626055),
('Soil_Type6', 0.0008428072272247173),
('Soil_Type12', 0.0008220119728596085),
('Soil_Type19', 0.0007941066969573623),
('Soil_Type16', 0.0006663609063786886),
('Soil_Type27', 0.0006367371586997001),
('Soil_Type5', 0.0005202128836162181),
('Soil_Type21', 0.00046820863373015387),
('Soil_Type34', 0.0004177332179436113),
('Soil_Type37', 0.0002093563560947094),
('Soil_Type14', 0.0001486075654973332),
('Soil_Type1', 0.0001187462519933239),
('Soil_Type9', 9.954564766801897e-05),
('Soil_Type26', 7.144465812347822e-05),
('Soil_Type25', 6.21906782106648e-05),
('Soil_Type28', 4.532135460523101e-05),
('Soil_Type36', 2.8952072832505438e-05),
('Soil_Type15', 2.5079919011072498e-05),
('Soil_Type7', 0.0),
('Soil_Type8', 0.0),
('Soil_Type18', 0.0)]
```

Almost all soil type features are not important and three of them are unnecessary. Let’s compare X_train with all features and only the top 15 important features by size.

```print ('All features:', X_train.memory_usage(index=True).sum()/1000000)
print ('Top 15 features:', X_train[[col for col in importances[:15]]].memory_usage(index=True).sum()/1000000)
```

Output:

```All features: 143.16104
Top 15 features: 41.646848
```

Size of memory usage is in Mb. More than 3.5 times! But let’s compare accuracy.

```X_train = X_train[[col for col in importances[:15]]]
X_valid = X_valid[[col for col in importances[:15]]]
cut_clf = DecisionTreeClassifier()
cut_clf.fit(X_train, y_train)
print(metrics.accuracy_score(y_valid, cut_clf.predict(X_valid)))
```

Output:

```0.9151975609156401
```

Almost the same results with 15 against 54 features. But computational complexity is smaller and time/memory consumption is greatly smaller. In large projects, this may be very important.

### Hyperparameters

DecisionTreeClassifier has some parameters that could reduce overfitting. Let’s take a look at them. Just create an empty decision tree classifier and print it.

```DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
```

We are interested in the attributes with ‘min_’ or ‘max_’ prefix.

• max_depth: The amount of the layers. The more this value is, the deeper the tree.
• max_features: The maximum count of the features from the original dataset.
• max_leaf_nodes: Restricting leaf nodes count.
• min_impurity_decrease: The minimum decreasing of the child nodes impurity.
• min_impurity_split: The minimum impurity condition for data splitting.
• min_samples_leaf: The number of samples in the single leaf (no ‘1 object leafs’).
• min_samples_split: A minimum of samples in the node for making a question.
• min_weight_fraction: The same as min_samples_leaf but expressed as a fraction of the total number of weighted instances.

Increasing “min_” and decreasing “max_” parameters could reduce overfitting. But be careful, because the entire model’s learning ability would be lower. Let’s try this with RandomSearchCV to hyper-tune our model.

```params_dist = {
'criterion': ['gini', 'entropy'],
'max_depth': randint(low=4, high=40),
'max_leaf_nodes': randint(low=1000, high=20000),
'min_samples_leaf': randint(low=20, high=100),
'min_samples_split': randint(low=40, high=200)
}
clf_tuned = DecisionTreeClassifier(random_state=42)
random_search = RandomizedSearchCV(clf_tuned, params_dist, cv=7)
random_search.fit(X_train, y_train)
random_search.best_estimator_
```

Output:

```DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=17,
max_features=None, max_leaf_nodes=18918,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=36, min_samples_split=57,
min_weight_fraction_leaf=0.0, presort=False,
random_state=42, splitter='best')
```

The classifier is more parametrized, so let’s see the accuracy results on a validation set.

```best_tuned_clf = random_search.best_estimator_
print(metrics.accuracy_score(y_valid, best.predict(X_valid)))
```

Output:

```0.842615131174547
```

84%! In our case, it has definitely reduced overfitting, but the cost is hurting the model accuracy. The validation set was predicted with lower accuracy, which is not good. So for this dataset, if we’re talking about results, hyper-parameters aren’t important at all. However, the complexity of best_tuned_clf is lower than that cut_clf. Also, it is faster and less sensitive to outliers.

Now we can compare two classifiers on the test set (don’t forget to leave only important features). But instead of a simple accuracy comparing, we can get an entire classification_report.

```print(metrics.classification_report(y_test, cut_clf.predict(X_test[[col for col in importances[:15]]])))
print(metrics.classification_report(y_test, best_tuned_clf.predict(X_test[[col for col in importances[:15]]])))
```

Output: • Precision: How accurate predictions were made. Example: if a model predicts 100 objects for a class ‘1’, but only 85 of them really belong to it, then precision = 85%.
• Recall: A ratio of predicted ‘N’-class objects to the total number that exists. Example: model predicts 50 objects for a class ‘1’, but the entire test set has 100 objects for it. Therefore, recall is 50%.
• F1-score: Harmonic mean between precision and recall. These metrics will be high only in a case where both precision and recall are high. Example: for previous examples, F1-score is 0.629.

Here we can see how good our model is in working with each class and estimate the productivity of the entire model. In the returned reports we can see that non-hypertuned classifier handles even minority classes with relatively high percentages, especially if we compare it with the hypertuned classifier. But remember, the model should generalize all of the data, not only split training. So let’s choose, for example, a model with higher metrics and mark it as working.

### Honorable mention

Of course, optimizing isn’t limited by feature selection and hyper-parameters. Here are some recommendations of popular techniques and tactics for common decision tree issues:

• Over/under- sampling: One of the most popular ways to handle imbalance. You can try to create synthetic objects for a minority class or delete unnecessary samples of the majority class. In Python, it is implemented via imblearn.
• PCA: Reduction of dimensionality. Sometimes, a dataset has too many features and samples and if decomposition is intelligent, it may help with overfitting. Python implementation in sklearn is described here.
• Random forest: Classification/regression algorithm based on decision trees. It randomly chooses the most productive trees from the subsets and learns them. Sklearn implementation is the most wide-used.

## Various visualization options of decision trees

One of the biggest attractions of the decision trees is their open structure. The algorithm is a ‘white box’ type, i.e., you can get an entire tree. Sometimes, it is very useful to visualize the final decision tree classifier model. Let’s apply this!

Python supports various decision tree classifier visualization options, but only two of them are really popular. First is export_graphviz() in sklearn and the second is dtreeviz() in third-package.

Each option has pros and cons, so you should understand what exactly is important for a model. If you want to get additional plots and graphs in each node you should take a look at a dtreeviz, if you prefer a built-in sklearn method with the ability to return impurity data and scale any tree then use export_graphviz.

We will make some visualizations via the sklearn method, and save it in the .dot file, which is standard for the graphs.

```export_graphviz(
cut_clf,
out_file='forest.dot',
feature_names=list(X_train),
class_names=labels,
rounded=True,
filled=True
)
```

This code will create a forest.dot file with an entire tree inside. Be careful, in our case, it may take time to do this because don’t forget, we have close to more than 200K samples. After that, you can convert this file in some visual format.

We can convert it in Python code via pydot package. It will create a graph based on the .dot file. But in our case, with such a giant tree (7.5 Mb in a text file) it would be very long and too big. So we can use dot instrument for the command line. Install graphviz and run this line in your cmd:

```dot -Tsvg forest.dot -o forest.svg
```

Dot allows us to create other file formats, but .png, .pdf, .ps files are remarkably small and tend to scale the image for the screen resolution.

It may be a long time to wait, but the SVG image with the entire tree will finally be created. In our case, the size of the image is 77 Mb. You could try to look at it in a browser: You can spend a few hours exploring this graph. But right now it is clear how the cut_clf works.

## Conclusion

There’s a reason decision trees are as popular as they are. Like we mentioned throughout this article, although they perform poorly in their basic form, they are easy to understand when stacked (Random Forest, XGBoost) and can reach excellent results. When it’s so simple to create independent machine learning environments it’s important to look at the explainability of different algorithms. This is where decision trees excel. 