Beginner’s Guide to Python Modeling Using XGBoost Package
Everyone knows that having a large number of loyal customers is the key to the success of a business. This statement is even more significant for banks. It is important for banks to retain their customers and prevent churn — the term used to describe when a customer ends their contract or business with an organization.
Can a bank know beforehand that a customer is going to leave? In the past, the answer to this question was based on a few standard rules. This rule-based decision making was prone to error and changing circumstances. However, with the advent of Big Data and machine learning, it has become possible to predict customer churn with a high degree of accuracy.
Let’s break down how a machine learning algorithm can help predict customer churn.
But first, a little about what we used for this blog post:
The dataset used to train the machine learning algorithm is freely available from Kaggle. Various machine learning libraries have been developed in different programming languages that can be used for training machine learning algorithms. In this article, the XGboost algorithm will be used to predict customer churn. The XGboost algorithm is a part of XGBoost package. The installation instructions for the package are available at this link. The library is freely available. All the code in this article has been written and tested using Python Jupyter Notebook.
Now without further ado, let’s jump into the code.
Before we can use any library in Python, we first need to import it. For data visualization, we will use matplotlib and seaborn libraries. To read and manipulate the dataset, we will use the pandas library. The XGboost library will be used to train the XGBClassifier. In addition, sklearn library will be used for data splitting and evaluating the performance of the trained machine learning model. The following script imports the required libraries.
import matplotlib.pyplot as plt import seaborn as sns import pandas as pd from sklearn.model_selection import train_test_split from xgboost import XGBClassifier from sklearn.metrics import classification_report, accuracy_score sns.set_style("darkgrid") %matplotlib inline
Importing the dataset
Once the required libraries have been imported, the next step is to import the dataset into the application. To do so, the read_csv() function of the pandas library can be used as shown below:
churn_data = pd.read_csv(r'E:/.../Customer_Data.csv')
Let’s see the number of columns and rows in the dataset:
The output shows that the dataset contains 10,000 rows and 11 columns.
The dataset columns can be printed via the head() function.
Let’s understand the data first. The first 10 of the 11 columns contain information about the customers, gathered six months before it was found whether or not the corresponding customers left the bank. After six months, if the customer left the bank a “1” was added in the corresponding row of the 11th column (i.e. “Exited”). If the customer stayed after six months, a “0” was entered in the “Exited” column.
Note: The name of the first column is “RowNumber” but due to space issue the name has been truncated.
To find statistical information about the data the describe() method is used.
You can see the count and mean standard deviation values for all the numerical columns. For instance, the average age of customers is 38.92.
Exploratory data analysis
The next step is to perform exploratory data analysis to study feature importance. To have a better view, the default size of the plots can be increased with the following script:
sns.countplot(x='Exited', data=churn_data, palette="Set2")
Let’s first see the number of customers who left the bank versus those who didn’t. We can use the count plot from the seaborn library for this purpose.
sns.countplot(x='Exited', data=churn_data, palette="Set2")
You can see that around 2,000 customers left the bank after six months.
Next, we will study the impact of the geography of the customer on customer churn. Let’s first plot the total number of customers per geographical location.
sns.countplot(x='Geography', data=churn_data, palette="Set2")
The customers belong to three different geographical locations: France, Spain, and Germany. While almost 50% of the total customers are French, the number of Spanish and German customers is around 25% each.
To find the impact of geography on customer churn, we can plot the count of values from the “Exited” column against a customer’s geography using the following script:
sns.countplot(x='Exited', hue='Geography', data=churn_data, palette="Set2")
The output shows that among the customers who left the bank, French and German customers have an equal number despite the fact that the total German customers are almost half of the total French customers. It shows that German customers are more likely to leave the bank than French and Spanish customers.
Next, the collective impact of age and geography on customer churn can be plotted with the box plot as shown below:
sns.boxplot(x='Exited', y= "Age", hue="Geography", data=churn_data, palette="Set2")
The above plot looks tricky but it is actually very simple. The box plot plots the quartile information for a column in boxes. On the left side, you can see three boxes (each for one geographical location) that contain quartile information of age for the customers who didn’t leave the bank. You can see that the customers between the age of 30 and 40 are more likely to stay for all three geographical locations. Similarly, from the three boxes on the right, it can be inferred that customers in the age group of around 40 to 50 are more likely to leave the bank.
In the same way, the relationship between age, gender, and customer churn can be plotted using the box plot as shown below:
sns.boxplot(x='Exited', y= "Age", hue="Gender", data=churn_data, palette="Set2")
Here again, irrespective of the gender, the customer in the age group of 30-40 are likely to stay while those in the age group of 40-50 are likely to leave the bank.
Let’s now study the impact of balance on customer churn. We can do so via the bar plot.
sns.barplot(x='Exited', y= "Balance", data=churn_data, palette="Set2")
The output shows that the average balance of the customers who left the bank is slightly greater than the average balance of those who stayed.
In the same way, the relation between estimated salary and the customer churn can be plotted:
sns.barplot(x='Exited', y= "EstimatedSalary", data=churn_data, palette="Set2")
It seems that the average salary of the customers who left the bank and those who didn’t is almost the same.
Finally, let’s plot the relationship between the activeness of a customer on customer churn. A customer’s activeness is determined based on the number of transactions in a certain period, the number of logins on the web portal of a bank, etc. The “IsActiveMember” column contains information regarding a customer’s activeness.
sns.countplot(x='IsActiveMember', hue= "Exited", data=churn_data, palette="Set2")
You can see that the ratio of customers who exited the bank (orange bar) is greater for inactive customers as compared to the active ones.
We saw the impact of some of the features on customer churn. Let’s now train our machine learning model on the dataset we have.
Before we can train our algorithm, we need to select the features that we want to use for training. The “RowNumber”, “CustomerId”, and “Surname” columns have purely random information and have no impact on customer churn. Nobody leaves or stays in a bank because he has a certain customer id or a specific surname. Let’s remove these columns from the dataset.
churn_data = churn_data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
Machine learning algorithms expect data in the form of features and labels. The features are used to train the model against the ground truth labels. Our feature set consists of all the columns except the “Exited” column, which contains the labels or the outputs. The following script divides the data into feature and label sets.
features = churn_data.drop(['Exited'], axis=1) labels = churn_data['Exited']
Next, we need to convert the columns that contain categorical information such as “Geography” and “Gender” into numeric ones since machine learning algorithms work with numbers. They are not, for example, capable of adding or subtracting “France” to a value in the balance column. One-Hot encoding is a technique used to convert categorical columns to numeric ones. In One-Hot encoding, a numeric column containing “1” or “0” is generated for each unique value in the original categorical column.
In our dataset, the “Geography” and “Gender” columns have categorical features. The following script performs One-Hot encoding for these categorical columns and replaces the features with their numeric counterparts.
temp_data = features.drop(['Geography', 'Gender'], axis=1) Geography = pd.get_dummies(features .Geography).iloc[:,1:] Gender = pd.get_dummies(features.Gender).iloc[:,1:] final_feature_set = pd.concat([temp_data,Geography,Gender], axis=1)
As the final pre-processing step, we need to divide the dataset into training and test sets. The training set is used to train the machine learning algorithm while the test set is used to evaluate the performance of the trained algorithm.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(final_feature_set, labels, test_size = 0.25, random_state = 42)
The dataset has been pre-processed and ready for training the algorithm.
Algorithm training and testing
As I said earlier, we will be using the XGboost algorithm for training our machine learning model. The following script creates an object of XGboost class (i.e. “model”) and trains it using the training set.
model = XGBClassifier(learning_rate =0.1, n_estimators=100 , random_state=42) model.fit(X_train, y_train)
Yes, it takes two lines to train the model. The training tests and labels are passed to the fit() method of the model object.
To make predictions on a new dataset, the predict() method is used as shown below:
y_pred = model.predict(X_test)
The “y_pred” variable contains the predictions.
To evaluate the performance of a classification algorithm, accuracy and F1 measure are two of the most commonly used metrics. The sklearn library contains built-in classes for these metrics. The following script finds the accuracy and F1 measure values for our algorithm.
print(classification_report(y_test,y_pred )) print(accuracy_score(y_test, y_pred))
Here is the output:
precision recall f1-score support 0 0.88 0.97 0.92 2003 1 0.77 0.47 0.59 497 micro avg 0.87 0.87 0.87 2500 macro avg 0.83 0.72 0.75 2500 weighted avg 0.86 0.87 0.85 2500 0.8676
The output shows an accuracy of 86.76%. This means that on average, out of 100 times, our algorithm correctly predicts which customers are likely to leave the bank 86.76 times. We achieve this accuracy with default parameters without fine-tuning our algorithm. Impressive, isn’t it?
Finding the best features
As the last step, we will find what the best indicators are to decide customer churn, according to our trained model. To do so, we can use the “feature_importance” attribute of our model and print the relative importance of N number of features. Let’s find the five most important features.
import numpy as np feat_importances = pd.Series(model.feature_importances_, index=final_feature_set.columns) feat_importances.nlargest(5).plot(kind='barh')
The output shows that according to our algorithm, age is the most important feature to predict customer churn followed by the number of products and the balance of the customer.
Retaining customers is vital for the survival of a business. It is important for organizations to know which customers are going to leave in the near future so that they can take measures to prevent them from leaving. Given relevant data, machine learning algorithms can help make the process of understanding customer churn easier.