Table of Contents

    In the world of machine learning, developers can create independent environments for projects easily. It only takes a few clicks to set and fit models in order to achieve solid results. Yet, many algorithms can be quite difficult to understand, let alone explain. Decision trees, while performing poorly in their basic form, are easy to understand and when stacked (Random Forest, XGBoost) reach excellent results. 

    Let’s take a deeper dive into decision tree analysis. 

    The history of decision trees

    Here’s a quick look at decision tree history:

    • 1963: The Department of Statistics at the University of Wisconsin–Madison writes that the first decision tree regression was invented in 1963 (AID project, Morgan and Sonquist). It had an impurity measure (we’ll get to that soon) and recursively split data into two subsets.
    • 1966: The Institute of Computing Science in the Poznań University of Technology states that one of the first publications on the decision tree model was in 1966 (by Hunt). In psychology, the decision tree methods were used to model the human concept of learning. Exploring the human mind, researchers discovered the decision tree algorithm was useful for programming.
    • 1972: The first classification tree appeared in the THAID project (by Messenger and Mandell). It worked via splitting data to maximize the sum of cases in the modal category. The predicted class was a mode.
    • 1974: Statistics professors Leo Breiman and Charles Stone from Berkeley and Jerome Friedman and Richard Olshen from Stanford started developing the classification and regression tree (CART) algorithm. 
    • 1977: Breiman, Stone, Friedman, and Olshen invented the first CART version.
    • 1984: The official publication with a CART decision tree software. It was a revolution in the world of algorithms. Even today, CART is one of the most used methods for decision tree data analytics. Main upgrades include truncating unnecessary trees, tunneling, and choosing the best tree version. CART became a world-standard for decision tree analysis, and its development kept progressing.
    • 1986: John Ross Quinlan proposed a new concept: trees with multiple answers. Important note: CART and all other decision tree classification algorithms only have two answers for each question (called binary trees). Quinlan invented ID3 (Iterative Dichotomiser 3) using an impurity criterion called gain ratio. It could cover other questions and propose some resulting nodes. ID3 wasn’t ideal, so its author continued to upgrade the algorithm structure. And this is how C4.5 was born. C4.5 addressed the shortcomings of its predecessor, ranking #1 in the Top 10 Algorithms in Data Mining pre-eminent paper (Springer LNCS, 2008). By that time the algorithm had existed for 15 years.

    Feature Generation: The Next Frontier of Data Science

    How decision trees work

    Decision Tree Diagram - Decision Tree Example Infographic

    The picture above illustrates and explains decision trees by using exactly that, a decision tree diagram. 

    The idea is quite simple and resembles the human mind. If we tried to split data into parts, our first steps would be based on questions. Step by step, data would separate in unique pieces and, finally, we would split samples. This is the essence of the entire concept.

    To explain this concept better, we will use some popular terminology:

    • Node: Each object in a tree. Decision tree nodes contain subsets of data, and excluding leaf nodes, a question splits the subset.
    • Parent node: The question that makes a data split.
    • Child node: Resulting node. It also can be a parent for its children.
    • Leaf node: Final node with no further questions. Only a subset of the data representing answers to preceding questions.
    • Branch: Unique line of the questions with answers that flow to a leaf node.
    • Root: The top node. 

    Decision Trees Pros and Cons

    The main benefits of decision trees are:

    1. They require less data preparation during preprocessing (no need for one-hot encoding and dummy variables)
    2. They do not require feature scaling (standardization and normalization)
    3. They can automatically handle missing values
    4. They are intuitive and easy to understand and visualize; they look like simple if-else statements, and therefore can be easily interpreted by humans.
    5. The training process is faster compared to some other algorithms (like random forests)

    The main decision tree issues are:

    1. The biggest issue of decision trees in machine learning is overfitting, which can lead to wrong decisions. A decision tree will keep generating new nodes to fit the data. This makes it complex to interpret, and it loses its generalization capabilities. It performs well on the training data, but starts making mistakes on unseen data.
    2. Overfitting causes high variance in outputs, which causes errors in the final decisions and inaccuracy in results.
    3. The addition of new data points leads to the re-generation of the overall tree meaning that nodes need to be recalculated.
    4. They are affected by noise, even a little bit of noise can make a decision tree model unstable.
    5. They are not suitable for large datasets. A large dataset can cause the tree to grow too large and complex, which will lead to overfitting.

    Why Use Decision Trees?

    A decision tree is a great way to help decide between different courses of action; it can visually represent decisions and decision making. Based on the decision tree pros and cons outlined above, it is evident that one of the main benefits is that they are easy to understand and interpret by humans. In fact, a simple decision tree algorithm is said to be one of the easiest machine learning algorithms, since there isn’t much math involved. It is also one of the most widely used methods in machine learning.

    Decision trees provide a framework to quantify the values of outcomes and the probabilities of achieving them. They can be used for both classification and regression problems, and create data models that will predict class labels or values for a decision-making process. The models are built from the training dataset fed to the system (supervised learning). A decision tree visualization helps outline the decisions in a way that is easy to understand, making it a popular data mining technique.

    Why pruning is important in decision tree analysis

    Pruning improves the performance of decision tree machine learning by reducing its size and removing the parts of the tree that do not provide power to classify instances. The process of pruning involves removing the branches that make use of features with low importance. This reduces the complexity of the tree, reduces overfitting, and increases its predictive power. Out of all of the machine learning algorithms, decision trees are the most susceptible to overfitting. Pruning reduces that likelihood. Read about the various decision tree pruning methods here.

    Regression and classification

    It’s possible to use decision trees with regression or classification. The techniques are slightly different, so we will review both starting with classification algorithms. Typically, for decision tree classification, the model should be learned on training data with a predefined set of labels. It would predict a label (i.e., class) for new samples.  

    So, we have a dataset with different attributes (features). Each sample has its own combination of the value of the features. How does a decision tree make assumptions for data splitting?

    To answer this question we should first understand the goal and the result of it. The goal is to separate samples with a more similar structure and the result is purer subsets. Purity, in this case, responds to the quality of the splitting. There are a few typical criteria for this:

    • Gini impurity: a measurement that takes a sum of the ratios for each class instance to entire node instances. After that, it subtracts 1 from it. An ‘Ideal’ node would have a 0 impurity. 

    gini impurity

    • Informational entropy: based on the entropy concept. In short, the more evenly arranged things are in a certain space, the greater the entropy.

    Informational entropy

    The differences between Gini impurity and informational entropy are not huge. Gini impurity makes a purer separation, but entropy tends to make more balanced subsets. Informational entropy is more expensive from a computational standpoint. 

    So, basically, the algorithm looks between all features for the highest impurity splitter, where the parent node will have bigger impurity than it’s children. A split’s effectiveness is measured by its impurity and by the amount of data in each child node, aspiring to minimize the size division between the child nodes.

    Finally, it will build a tree that will have a bunch of questions with answers. For a prediction, the model just needs to guide through the tree for the concrete leaf. Each leaf has probabilities for each existing class, so the model will choose higher. That’s it. A classification tree is ready.

    Now let’s talk about a regression task. A regression is when a target value doesn’t have a predefined set of labels, but a numerical result. The main problem here is predicting new target values based on training data.

    The regression technique is not very different from classification with decision trees. The main distinction is the measurement for the splitting. Instead of impurity, regression trees use mean squared error (MSE). Questions gradually reduce MSE until it reaches a minimum. The node result is the average target value from all samples in it. MSE is calculated like this:

    MSE calculation

    After fitting, the model guides a new object to the correct branch and assigns the average target value. Simple and powerful. A regression tree is complete.

    Implementation options in modern libraries

    Python accounts for a large number of powerful ML libraries. Some of them are widespread and a must-know for everybody, others are rare and very specialized. Here, we will cover searching for useful decision tree implementations.

    Of course, you can write the code by yourself. Each famous algorithm has a detailed description, so with basic language knowledge, it shouldn’t be too difficult. But sometimes it can waste precious development time because, instead of the model set, you need to create an implementation of the learning algorithm.

    1. Scikit-learn: CART model. Basic and maybe the most popular implementation of decision trees is Python. It supports multiple optimization options and can be combined with many other sklearn models and pipelines.
    2. Imblearn: Special classifier based on the sklearn model but with larger attention to imbalanced datasets (several classes have a big difference in size).
    3. Spark ML library: Decision Tree model with some useful features like depth, top node accessibility, etc.
    4. Decision-tree-id3: Library with ID3 method for a Python. 
    5. Eli5: The connection between Eli5 and sklearn libraries with a DTs implementation.

    For this article, we will use scikit-learn implementation, because it is fully maintained, stable, and very popular.

    Application of decision trees for forest classification with dataset in Python

    Let’s put all of this talk into practice. All that you need is Python 3 on your PC, with previously installed libraries: scikit-learn, Pandas, SciPy, and Jupyter Notebook.

    There is no more logical data to learn via decision tree classifier, than … tree classifications. With that in mind we will use data from the forest cover type dataset on Kaggle. You can find the full description on the task page. Our goal is to make a model on this data. 

    This large dataset is taken from the observations in four areas of the Roosevelt National Forest in Colorado. It contains 54 features and more than 500K records. Each record is a 30-meter by 30-meter forest measurement. The dataset contains seven classes:

    1. Spruce/Fir
    2. Lodgepole Pine
    3. Ponderosa Pine
    4. Cottonwood/Willow
    5. Aspen
    6. Douglas-fir
    7. Krummholz

    It’s important to note that it’s not necessary to use Jupyter. However, we will do this for visibility. All code from this experiment works in simple Python files and any exceptions will be highlighted.

    Let’s start. Run Jupyter Notebook and create a new .ipynb file. In the code section paste the following:

    import pandas as pd
    from sklearn import tree
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split, cross_val_score
    from sklearn import metrics
    from sklearn.tree import export_graphviz
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    from scipy.stats import randint

    Here we have imported all the required methods and libraries for all further processes. Don’t worry, if something is new to you, it will be clarified as we continue on.

    Now it’s time to take a look at our dataset. Download it from the task page and save it in the project folder.

    data = pd.read_csv('covtype.csv')


    decision tree outcome

    Wow! 55 columns, 54 of them are features. What are they?


    class 'pandas.core.frame.DataFrame'>
    RangeIndex: 581012 entries, 0 to 581011
    Data columns (total 55 columns):
    Elevation                             581012 non-null int64
    Aspect                                581012 non-null int64
    Slope                                 581012 non-null int64
    Horizontal_Distance_To_Hydrology      581012 non-null int64
    Vertical_Distance_To_Hydrology        581012 non-null int64
    Horizontal_Distance_To_Roadways       581012 non-null int64
    Hillshade_9am                         581012 non-null int64
    Hillshade_Noon                        581012 non-null int64
    Hillshade_3pm                         581012 non-null int64
    Horizontal_Distance_To_Fire_Points    581012 non-null int64
    Wilderness_Area1                      581012 non-null int64
    Wilderness_Area2                      581012 non-null int64
    Wilderness_Area3                      581012 non-null int64
    Wilderness_Area4                      581012 non-null int64
    Soil_Type1                            581012 non-null int64
    Soil_Type2                            581012 non-null int64
    Soil_Type3                            581012 non-null int64
    Soil_Type4                            581012 non-null int64
    Soil_Type5                            581012 non-null int64
    Soil_Type40                           581012 non-null int64
    Cover_Type                            581012 non-null int64
    dtypes: int64(55)
    memory usage: 243.8 MB

    10 features are numerical and clearly explained. The dataset also has 4 ‘Wilderness_AreaN’ and 40 ‘Soil_TypeN’ fields which are boolean, meaning only the areas can be “true” (or 1) for a single observation, all others would be “false” (or 0). 

    Now we need to follow the golden rule in ML: create a test sample.

    X = data.drop('Cover_Type', axis=1)
    y = data['Cover_Type']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    All columns except the last one is our “X” data. The last column is considered as “y” data. “X” data are features and “y” data are labels for each record. Now we will use the train_test_split method. “X_test” and “y_test” objects will be left until the end of the experiment. This is important because, in the end, we want to have a system that is good at generalizing and predicting new data. 

    Now let’s work on class proportions.

    labels = ['Spruce/Fir', 'Lodgepole Pine', 'Ponderosa Pine',
         	'Cottonwood/Willow', 'Aspen', 'Douglas-fir', 'Krummholz']
    y_train.index = [labels[i-1] for i in y_train]


    Lodgepole Pine       198162
    Spruce/Fir          	148488
    Ponderosa Pine        25000
    Krummholz             14294
    Douglas-fir           12256
    Aspen                  6618
    Cottonwood/Willow      1890
    dtype: int64

    Note that classes are highly imbalanced, which may cause a problem because decision trees tend to get bigger classes clearer than smaller. As we have so much data, we can use test samples and validation. This is useful for improving model accuracy before the final test. We will explore the same train_test_split method:

    X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.2)

    So, we have 50% of all data in a training set, 30% is testing, 20% is validating. Now it’s time to create our first decision tree classifier!

    clf = DecisionTreeClassifier()
    cross_val_score(clf, X_train, y_train, cv=7)


    array([0.92240341, 0.92175634, 0.92169012, 0.92046212, 0.92304051,
           0.92256718, 0.92037093])

    For creating a tree object, we use DecisionTreeClassifier. Instead of direct learning, we adopt the cross-validation technique. This technique splits the entire training set N times with each iteration responding to one learning process. The main reason for this? We can test our data on training samples and get a general learning result. As you’ll see, the accuracy is really good with each fold reaching 92%. For automating this stage, sklearn has a cross_val_score method.

    Okay, the model is good on a training set. But how easy will it handle validation?, y_train)
    print(metrics.accuracy_score(y_valid, clf.predict(X_valid)))



    Wow, the same results on a validation set! 

    Optimizing models to improve predictions

    Unfortunately, the results are often worse than in our case. Decision trees tend to overfit the training data, which means absolutely remarkable results in the learning process but very low results in testing. Decision tree designed without limitations on depth or impurity in a split will create a very complex tree, with a leaf for each observation. Again, this means perfect results on the training set but a classic case of overfitting that will fail on unseen data. To avoid this, here are some best practices.

    Feature selection

    Sometimes features are not only un-useful but also create false importance. Feature selection could fix this problem. DecisionTreeClassifier object has a really helpful attribute, called feature_importances_, that returns the importance of each feature. Let’s see those attributes in our classifier.



    array([3.37518529e-01, 2.84656713e-02, 1.76777678e-02, 6.27777911e-02,
           4.48479825e-02, 1.49326271e-01, 3.13041542e-02, 3.37686965e-02,
           2.29711764e-02, 1.39970429e-01, 8.35700869e-03, 4.33672974e-03,
           1.47555544e-02, 1.28823512e-03, 1.18746252e-04, 9.60672295e-03,
           1.68313126e-03, 1.19379073e-02, 5.20212884e-04, 8.42807227e-04,
           0.00000000e+00, 0.00000000e+00, 9.95456477e-05, 3.50105965e-03,
           8.58318195e-04, 8.22011973e-04, 2.56493210e-03, 1.48607565e-04,
           2.50799190e-05, 6.66360906e-04, 1.86613628e-03, 0.00000000e+00,
           7.94106697e-04, 3.17229491e-03, 4.68208634e-04, 7.45929560e-03,
           9.06540846e-03, 5.36007876e-03, 6.21906782e-05, 7.14446581e-05,
           6.36737159e-04, 4.53213546e-05, 7.24989689e-03, 2.83808978e-03,
           4.43184249e-03, 1.23541381e-02, 4.28257875e-03, 4.17733218e-04,
           8.51854676e-04, 2.89520728e-05, 2.09356356e-04, 2.42011853e-03,
           3.76673702e-03, 1.38603866e-03])

    Wait a minute, it is difficult to understand! Let’s combine these values with feature names and sort in descending order.

    def sortSecond(val):
    	return val[1]
    values = clf.feature_importances_
    features = list(X)
    importances = [(features[i], values[i]) for i in range(len(features))]
    importances.sort(reverse=True, key=sortSecond)


    [('Elevation', 0.3375185285673537),
     ('Horizontal_Distance_To_Roadways', 0.14932627135520854),
     ('Horizontal_Distance_To_Fire_Points', 0.13997042859131698),
     ('Horizontal_Distance_To_Hydrology', 0.06277779109891506),
     ('Vertical_Distance_To_Hydrology', 0.044847982524894914),
     ('Hillshade_Noon', 0.03376869647871325),
     ('Hillshade_9am', 0.03130415418918785),
     ('Aspect', 0.028465671348228104),
     ('Hillshade_3pm', 0.022971176447821838),
     ('Slope', 0.01767776783558953),
     ('Wilderness_Area3', 0.014755554433198528),
     ('Soil_Type32', 0.012354138099619834),
     ('Soil_Type4', 0.011937907301667192),
     ('Soil_Type2', 0.009606722953780527),
     ('Soil_Type23', 0.009065408464270798),
     ('Wilderness_Area1', 0.00835700868960834),
     ('Soil_Type22', 0.007459295603405485),
     ('Soil_Type29', 0.0072498968949893825),
     ('Soil_Type24', 0.0053600787578572855),
     ('Soil_Type31', 0.004431842489166107),
     ('Wilderness_Area2', 0.004336729739051701),
     ('Soil_Type33', 0.004282578752504762),
     ('Soil_Type39', 0.0037667370218322925),
     ('Soil_Type10', 0.003501059648836054),
     ('Soil_Type20', 0.003172294913549069),
     ('Soil_Type30', 0.002838089780501163),
     ('Soil_Type13', 0.0025649320984729483),
     ('Soil_Type38', 0.0024201185325595865),
     ('Soil_Type17', 0.0018661362800872908),
     ('Soil_Type3', 0.0016831312614147462),
     ('Soil_Type40', 0.0013860386558133934),
     ('Wilderness_Area4', 0.001288235117597936),
     ('Soil_Type11', 0.0008583181952767269),
     ('Soil_Type35', 0.0008518546762626055),
     ('Soil_Type6', 0.0008428072272247173),
     ('Soil_Type12', 0.0008220119728596085),
     ('Soil_Type19', 0.0007941066969573623),
     ('Soil_Type16', 0.0006663609063786886),
     ('Soil_Type27', 0.0006367371586997001),
     ('Soil_Type5', 0.0005202128836162181),
     ('Soil_Type21', 0.00046820863373015387),
     ('Soil_Type34', 0.0004177332179436113),
     ('Soil_Type37', 0.0002093563560947094),
     ('Soil_Type14', 0.0001486075654973332),
     ('Soil_Type1', 0.0001187462519933239),
     ('Soil_Type9', 9.954564766801897e-05),
     ('Soil_Type26', 7.144465812347822e-05),
     ('Soil_Type25', 6.21906782106648e-05),
     ('Soil_Type28', 4.532135460523101e-05),
     ('Soil_Type36', 2.8952072832505438e-05),
     ('Soil_Type15', 2.5079919011072498e-05),
     ('Soil_Type7', 0.0),
     ('Soil_Type8', 0.0),
     ('Soil_Type18', 0.0)]

    Almost all soil type features are not important and three of them are unnecessary. Let’s compare X_train with all features and only the top 15 important features by size.

    print ('All features:', X_train.memory_usage(index=True).sum()/1000000)
    print ('Top 15 features:', X_train[[col[0] for col in importances[:15]]].memory_usage(index=True).sum()/1000000)


    All features: 143.16104
    Top 15 features: 41.646848

    Size of memory usage is in Mb. More than 3.5 times! But let’s compare accuracy.

    X_train = X_train[[col[0] for col in importances[:15]]]
    X_valid = X_valid[[col[0] for col in importances[:15]]]
    cut_clf = DecisionTreeClassifier(), y_train)
    print(metrics.accuracy_score(y_valid, cut_clf.predict(X_valid)))



    Almost the same results with 15 against 54 features. But computational complexity is smaller and time/memory consumption is greatly smaller. In large projects, this may be very important.


    DecisionTreeClassifier has some decision tree classifier parameters that could reduce overfitting. Let’s take a look at them. Just create an empty decision tree classifier and print it.

    DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, presort=False,
                           random_state=None, splitter='best')

    We are interested in the attributes with ‘min_’ or ‘max_’ prefix. 

    • max_depth: The amount of the layers. The more this value is, the deeper the tree.
    • max_features: The maximum count of the features from the original dataset.
    • max_leaf_nodes: Restricting leaf nodes count.
    • min_impurity_decrease: The minimum decreasing of the child nodes impurity.
    • min_impurity_split: The minimum impurity condition for data splitting.
    • min_samples_leaf: The number of samples in the single leaf (no ‘1 object leafs’).
    • min_samples_split: A minimum of samples in the node for making a question.
    • min_weight_fraction: The same as min_samples_leaf but expressed as a fraction of the total number of weighted instances.

    Increasing “min_” and decreasing “max_” parameters could reduce overfitting. But be careful, because the entire model’s learning ability would be lower. Let’s try this with RandomSearchCV to hyper-tune our model.

    params_dist = {
    	'criterion': ['gini', 'entropy'],
    	'max_depth': randint(low=4, high=40),
    	'max_leaf_nodes': randint(low=1000, high=20000),
    	'min_samples_leaf': randint(low=20, high=100),
    	'min_samples_split': randint(low=40, high=200)
    clf_tuned = DecisionTreeClassifier(random_state=42)
    random_search = RandomizedSearchCV(clf_tuned, params_dist, cv=7), y_train)


    DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=17,
                           max_features=None, max_leaf_nodes=18918,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=36, min_samples_split=57,
                           min_weight_fraction_leaf=0.0, presort=False,
                           random_state=42, splitter='best')

    The classifier is more parametrized, so let’s see the accuracy results on a validation set.

    best_tuned_clf = random_search.best_estimator_
    print(metrics.accuracy_score(y_valid, best.predict(X_valid)))



    84%! In our case, it has definitely reduced overfitting, but the cost is hurting the model accuracy. The validation set was predicted with lower accuracy, which is not good. So for this dataset, if we’re talking about results, hyper-parameters aren’t important at all. However, the complexity of best_tuned_clf is lower than that cut_clf. Also, it is faster and less sensitive to outliers.

    Now we can compare two classifiers on the test set (don’t forget to leave only important features). But instead of a simple accuracy comparing, we can get an entire classification_report.

    print(metrics.classification_report(y_test, cut_clf.predict(X_test[[col[0] for col in importances[:15]]])))
    print(metrics.classification_report(y_test, best_tuned_clf.predict(X_test[[col[0] for col in importances[:15]]])))


    classification report

    The classification report returns more advanced metrics about each separate class. 

    • Precision: How accurate predictions were made. Example: if a model predicts 100 objects for a class ‘1’, but only 85 of them really belong to it, then precision = 85%.
    • Recall: A ratio of predicted ‘N’-class objects to the total number that exists. Example: model predicts 50 objects for a class ‘1’, but the entire test set has 100 objects for it. Therefore, recall is 50%.
    • F1-score: Harmonic mean between precision and recall. These metrics will be high only in a case where both precision and recall are high. Example: for previous examples, F1-score is 0.629.

    Here we can see how good our model is in working with each class and estimate the productivity of the entire model. In the returned reports we can see that non-hypertuned classifier handles even minority classes with relatively high percentages, especially if we compare it with the hypertuned classifier. But remember, the model should generalize all of the data, not only split training. So let’s choose, for example, a decision tree model with higher metrics and mark it as working.

    Honorable mention

    Of course, optimizing isn’t limited by feature selection and hyper-parameters. Here are some recommendations of popular techniques and tactics for common decision tree issues:

    • Over/under- sampling: One of the most popular ways to handle imbalance. You can try to create synthetic objects for a minority class or delete unnecessary samples of the majority class. In Python, it is implemented via imblearn.
    • PCA: Reduction of dimensionality. Sometimes, a dataset has too many features and samples and if decomposition is intelligent, it may help with overfitting. Optimal decision tree Python implementation in sklearn is described here.
    • Random forest: Classification/regression algorithm based on decision trees. It randomly chooses the most productive trees from the subsets and learns them. Sklearn implementation is the most wide-used.

    Various decision tree visualization options

    One of the biggest attractions of the decision trees is their open structure. The algorithm is a ‘white box’ type, i.e., you can get an entire tree. Sometimes, it is very useful to visualize the final decision tree classifier model. Let’s apply this!

    Python supports various decision tree classifier visualization options, but only two of them are really popular. First is export_graphviz() in sklearn and the second is dtreeviz() in third-package.

    Each option has pros and cons, so you should understand what exactly is important for a model. If you want to get additional plots and graphs in each node you should take a look at a dtreeviz, if you prefer a built-in sklearn method with the ability to return impurity data and scale any tree then use export_graphviz.

    We will make some visualizations via the sklearn method, and save it in the .dot file, which is standard for the graphs.


    This code will create a file with an entire tree inside. Be careful, in our case, it may take time to do this because don’t forget, we have close to more than 200K samples. After that, you can convert this file in some visual format.

    We can convert it in Python code via pydot package. It will create a graph based on the .dot file. But in our case, with such a giant tree (7.5 Mb in a text file) it would be very long and too big. So we can use dot instruments for the command line. Install graphviz and run this line in your cmd:

    dot -Tsvg -o forest.svg

    Dot allows us to create other file formats, but .png, .pdf, .ps files are remarkably small and tend to scale the image for the screen resolution. 

    It may be a long time to wait, but the SVG image with the entire tree will finally be created. In our case, the size of the image is 77 Mb. You could try to look at it in a browser:

    decision tree

    You can spend a few hours exploring this decision tree infographic. But right now it is clear how the cut_clf works. 


    There’s a reason so many people build a decision tree for machine learning. Like we mentioned throughout this article, although they perform poorly in their basic form, they are easy to understand when stacked (Random Forest, XGBoost) and can reach excellent results. When it’s so simple to create independent machine learning environments it’s important to look at the explainability of different algorithms. This is where decision tree analysis excels.

    If you are looking to improve your predictive decision tree machine learning model accuracy with better data, try Explorium’s External Data Platform for free now! 

    About Explorium

    Explorium provides the first External Data Platform to improve data analytics and machine learning. Explorium enables the automation of data discovery to improve predictive ML model performance. Explorium External Data Management Platform empowers data scientists and analysts to acquire and integrate relevant external data signals efficiently, cost-effectively, and in compliance with regulations. With faster, better insights from their models, organizations across fintech, insurance, consumer goods, retail, and e-commerce can increase revenue, streamline operations and reduce risks. Learn more at

    Feature Generation: The Next Frontier of Data Science