Table of Contents

    When it comes to choosing which algorithm to deploy in production, the deciding factors go far and beyond than just measuring the prediction accuracy of every algorithm. You should be asking yourself: Is the algorithm fast enough to deliver on the volumes of data we will encounter in production? How much memory does is consume? But probably the biggest trade-off against the accuracy of machine learning models is their explainability- can I explain the model and understand what I learned?

    In this blog we will walk you through the “spectrum of complexity” of machine learning models by exploring the trade offs between simple, easy to interpret models and more complex ones and by examining the explainability of each algorithm we will encounter.

    Although simple, explainable models often fall short trying in finding complex patterns in the data, the ability to interpret and glance into those “explainable” models could help us:

    • “Debug” the model: making sure it learned actual patterns and there is no “data leakage” or overfitting. Gaining trust in the model is obviously a hard thing
    • Derive insights
    • Improve the model: Identify weak spots and vulnerabilities in the model in order to improve it and gain additional ideas for high-impact features

    Let’s start by getting our hands dirty with real-life data so we can explore and compare how different models can impact a business decision.

    We’ll start by importing the relevant libraries:

    # modeling libraries
    from sklearn.model_selection import cross_val_score
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import LinearSVC
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
    from sklearn.naive_bayes import GaussianNB
    from sklearn.model_selection import cross_val_predict, cross_val_score
    from xgboost import XGBClassifier
    # Libraries for manipulating data:
    import pandas as pd 
    import numpy as np
    import warnings

    And we’ll go with an e-commerce use case that requires us to build a model that predicts whether or not a website visitor will buy our product.

    We can use a propensity model like this to optimize the funnel and user experience accordingly in order to perform actions such as showing discounts to customers who are less likely to make a purchase, etc.

    The data sources we will be dealing with are:

    1. Core tagged data: whether the customer is a paying customer or not
    2. Crm_data: for this model we will only observe customers that have an existing CRM record with basic information. This information will be used to connect the data to external and 3rd party data
    3. Website analytics: clicks, scrolls, page views per user
    4. Catalog: connected to the website analytics to get more context on the products the customer is looking at

    The target column we will want to predict is the ‘paying customer (1/0).’

    paying_users_df = pd.read_csv('~/blogs/propensity_modeling/conversiton.csv')
    crm_df = pd.read_csv('~/blogs/propensity_modeling/crm.csv')
    website_analytics = pd.read_csv('~/blogs/propensity_modeling/website_analytics.csv')
    catalog = pd.read_csv('~/blogs/propensity_modeling/catalog.csv')
    Out: 245 user_id paying customer (1/0)
    0 id_6443772949357306687 1
    1 id_565093448583115634 0
    2 id_2864656352317940240 0
    3 id_5728513598666929907 0
    4 id_475202609164704146 1

    To keep our focus on explainability, and in order not to spend most of our time on engineering features just to get something to build our models on, we will be using Explorium’s automated feature discovery capabilities (via the programmatic API) to automate the entire process.

    In short, Explorium will receive the raw, disconnected datasets, enrich them with a lot more external sources (i.e. professional and academic background, demographics, spending habits by zip code, date related events, geographical attributes, etc.), automatically generate an enormous amount of features (from internal data as well as external data) and will select the best subset of features for the task of predicting our target column.

    Here is how we use the SDK:

    from explorium_sdk.data_bundle import DataBundle
    from explorium_sdk.features import search_for_features
    data_bundle = DataBundle(
        core_dataset = paying_users_df,
        contexual_datasets = [crm, website_analytics, catalog],
        connections = {
            paying_users_df['user_id']: crm['user_id'],
            paying_users_df['user_id']: website_analytics['user_id'],
            website_analytics['sku']: catalog:['sky']
    features = search_for_features(
        label_column='paying customer (1/0)', 


    Out: 246 paying customer (1/0) Spending Score (Apparel) By Zip Code Person Occupation (by email).industry == “Retail” Count Distinct (page_view_url) avg(timestamp – website_visit_start_time) avg(seconds_between_events) Popular Gender->F avg(Average Number of Reviews) Popular Gender->M Mean Family Income max(Separated) var(Separated) min(Percentage of Homes With Some Type of Debt) mean(Mean Monthly Owner Costs) sum(Divorced) max(Percentage of Homes With Some Type of Debt) max(Married) var(Population) var(Female Population) Popular Gender->empty
    0 1 410.684851 False 15.0 -17808.058824 61.73 True 21.750000 False 0.0000 0.00000 0.00000 0.00000 0.000000 0.00000 0.00000 0.00000 0.000000e+00 0.000000 0
    1 0 244.235297 False 2.0 -17805.000000 95.44 True 0.000000 False 115900.6672 0.01312 0.00000 0.66580 872.916850 0.05977 0.66580 0.61735 0.000000e+00 0.000000 0
    2 0 100.088075 False 2.0 -17763.000000 84.59 False 19.000000 False 0.0000 0.00000 0.00000 0.00000 0.000000 0.00000 0.00000 0.00000 0.000000e+00 0.000000 1
    3 0 2.388643 False 6.0 -17774.000000 61.55 False 162.288000 True 0.0000 0.00000 0.00000 0.00000 0.000000 0.00000 0.00000 0.00000 0.000000e+00 0.000000 0
    4 1 812.205378 True 5.0 -17787.000000 85.00 True 18.333333 False 101454.3116 0.02281 0.000173 0.61426 715.876413 0.24723 0.73133 0.67378 1.359672e+06 311440.333333 0
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import roc_auc_score
    features = features.drop('paying customer (1/0)', axis=1)
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, random_state=42)

    Let’s start with the first model we all learned during our first dive into data science: Linear Regression. From an explainability point of view it’s perfect because it’s linear. From an accuracy point of view, it’s (usually) terrible because it’s linear. There are no interactions between features or complex patterns learned. So although it is very robust to overfitting, it is also in a higher risk of underfitting.

    Let’s train the model on the features we’ve extracted:

    classifier = LogisticRegressionCV(), y_train)
    predictions = classifier.predict_proba(X_test)[:,1]
    print(f"AUC score from linear model: {roc_auc_score(y_true=y_test, y_score=predictions)}")

    AUC score from linear model: 0.6042318566835239

    This is not a good fit. An 0.604 AUC score pretty much means there was almost nothing learned at all. Later on we will demonstrate how more complex models are capable of capturing a better essence of the data and patterns (and score a higher AUC).

    But for now let’s interpret the model to extract insights and basic patterns. Because the model is so simple and the equation is basically a linear combination of weights, we can plot the essence of what it learned pretty easily:

    %matplotlib inline
    weights = pd.Series(classifier.coef_[0], index=features.columns)
    weights = weights.reindex(weights.abs().sort_values(ascending=False).index)
    weights[:13].plot(kind='barh', color='lightseagreen')



    <matplotlib.axes._subplots.AxesSubplot at 0x13

    The correlations are pretty straight forward:

    Spending score (geo-based enrichment) is positively correlated with a higher percentage of converting customers while the number of physical stores around the customer’s location is actually a negatively influencing factor (kind of makes sense, given that the model is built on e-commerce data – the more stores you have around your neighborhood, the less you feel the need to buy on the internet). But linear weights could be weak classifiers when it comes to problems where the patterns are a bit more complicated and interactions between features could be beneficial in terms of discovering predictive factors.

    Let’s take a model which is a bit more complex from that point of view: Decision Tree.

    Decision trees help us model interactions between different features and use rules learned from the data by finding optimal splitting points.

    Let’s start by training a simple decision tree:

    classifier = DecisionTreeClassifier(max_depth=10), y_train)
    predictions = classifier.predict_proba(X_test)[:,1]
    print(f"AUC score for decision tree model: {roc_auc_score(y_true=y_test, y_score=predictions)}")

    AUC score for decision tree model: 0.6222223419617647

    An improvement of 3% from the linear model isn’t a great one, but definitely a move in the right direction. Decision trees are wonderful models to extract rules that are a bit more complicated than a human could come up with on their own (or at least save a lot of time).

    Let’s visualize the tree:

    from explorium.sdk.utils import visualize_decision_tree
    visualize_decision_tree(tree, features)

    This is a pretty cool way to extract insights from the model; for every node you have a condition. For example, the first condition dictates what happens in the salary of the customer is larger than $5,980.00. if the condition it true for the specific sample go left, if not – go right.

    The more blue the branch gets, the higher the probability that the visitor will purchase the product. The more red it gets, the smaller the chance the visitor will convert to a paying customer. As we can see, the decision tree classifier learned different things, potentially more complex than the linear regression one. For example, all customers with an estimated payroll larger than $5,980 and who spend more than 0.954 seconds between activities in the website (clicks, scrolls, etc) are much more likely to make a purchase.

    That rule is a very interesting one, as well as an actionable one. Maybe we should change the website specifically for those users? Maybe raise our prices or disable some promotions?
    But the cool thing about it is that those rules were inferred automatically by the model! No human was needed in trying to tune those rules and threshold until they got the patterns that work.

    Now we will move on to a more complicated algorithm – RandomForest. It’s way more complicated than a simple decision tree, because, well, it contains multiple decision trees. RandomForest is an ensemble of decision trees made to fuse multiple “weak” learners (decision trees) into one strong ensemble model.

    from sklearn.ensemble import RandomForestClassifier
    classifier = RandomForestClassifier(max_depth=10, n_estimators=500), y_train)
    predictions = classifier.predict_proba(X_test)[:,1]
    print(f"AUC score for decision tree model: {roc_auc_score(y_true=y_test, y_score=predictions)}")

    AUC score for decision tree model: 0.6784661821787067

    We see a 4% improvement over the decision tree and 7% improvement over the linear model. Obviously, the more complex the model is the better one in this situation as it grasps more complex patterns in the data allowing it to make more accurate predictions. Remember that models with too many parameters will experience overfitting, but this is out of the scope of this blog.

    Although it’s way more accurate, it would be hard to analyze this model. In fact,this model is a combination of 500 different decision trees and therefore will require us to look at each and every decision tree to actually uncover the patterns it discovered in the data.

    Last but not least, let’s see what happens when we go even further in the spectrum of complexity.

    Let’s do a bit of “stacking.” We will train multiple machine learning models, combine them and feed them into one model.

    n_rows = len(features)
    classifiers = [
        RandomForestClassifier(max_depth=10, n_estimators=500, n_jobs=-1),
    predictions_as_features = []
    for cls in classifiers:
        cls_predictions = cross_val_predict(cls, features, labels, method='predict_proba')[:,1]
        print(f'{cls.__class__.__name__} AUC score == {roc_auc_score(y_true=labels, y_score=cls_predictions)}')
        predictions_as_features.append(cls_predictions.reshape(n_rows, -1))
    # Create features from the low level classifiers - 
    predictions_as_features = np.concatenate(predictions_as_features, axis=1)
    # Train a random forest model on top - 
    score = np.mean(cross_val_score(RandomForestClassifier(max_depth=4), predictions_as_features, labels, scoring='roc_auc', cv=5))
    print(f'AUC score of combination of models {score}')

    RandomForestClassifier AUC score == 0.6811876656908529
    XGBClassifier AUC score == 0.6928055019470039
    KNeighborsClassifier AUC score == 0.5560842777695425
    GaussianNB AUC score == 0.5945711663764063
    DecisionTreeClassifier AUC score == 0.6219268004797422
    LogisticRegressionCV AUC score == 0.5747947765285704
    AUC score of combination of models 0.70120148588394

    We got an additional 3.4% on our AUC!

    Obviously, the model is now much more complex, harder to explain, and more difficult to derive insights from.

    But that’s the message of this blog post, there’s a trade-off between complexity and explainability- or at least there used to be. In next blog post we’ll show how model explainability tools (like LIME, SHAP, and more) are disrupting this balance by allowing explainability in complex models.