Understand and Handling Data Drift and Concept Drift

Table of Contents

I know you’ve heard this a million times, but I’ll say it one more time: nothing lasts forever. Youth is not eternal, your phone gets slower, and machine learning models deteriorate over time. Like the second law of thermodynamics says, over time, things tend towards disaster. In the world of machine learning, this translates to the event in which a model’s predictive power worsens over time.

Chances are that your model is not running in a static environment, using static data; that is, data whose statistical properties do not change. In these circumstances, your model should not lose any of its performance because the data we are predicting comes from the same distribution as the data used for training. However, what if your model exists in a dynamic and continually changing environment? An environment involving many variables, including some that we can’t control? In this case, the performance of the model will change too.

Over time, a machine learning model starts to lose its predictive power, a concept known as model drift. This drift is a phenomenon that creeps into our models as time goes by and if we don’t detect it on time, it can have detrimental effects on our pipeline or services. In this article, we’ll take an in-depth look into model drift and explore two of its most significant causes: concept drift and data drift. Moreover, we’ll shed some light on why they happen, their implications, how we can detect them, and, ultimately, how to overcome their effects.

Concept drift

According to the literature, concept drift is the phenomenon where the statistical properties of the class variable — in other words, the target we want to predict — change over time. When a model is trained, it knows a function that maps the independent variables, or predictors, to the target variables. In a static and perfect environment where none of these predictors nor the target evolves, the model should perform as it did on day one because there’s no change.

However, in a dynamic setting, not only do the statistical properties of the target variable change but so does its meaning. When this change happens, the mapping found is no longer suitable for the new environment. Let’s illustrate an example.

Suppose that you work for a social app and are responsible for maintaining an anti-spam service. On this platform, you have one model that uses several features to predict whether a user is a spammer or not. It is very accurate and keeps the platform free of pesky spammers. At the end of the quarter, you sit with your team leader and notice that over time, the outcome of the predictions has drastically changed. In the best-case scenario, this change could have been because the spammers gave up. Or, and this is the worst-case, the change is because the concept of spammers has evolved.

Back when you trained the model, you had a particular idea of what a spammer was and what features were significant. For example, at some point during the lifetime of the app, you may have thought that a user who sends ten messages in one minute is a spammer, so you trained a model using that feature. Then, as the app grew and became more popular, you realized that people are chatting and messaging more.

So, the original idea you and the model had about what it means to be a spammer has changed, and now sending ten messages in a minute becomes normal and not something that only spammers do. In other words, the concept of spammers has drifted. Consequently, since you haven’t updated your model, it will, unfortunately, predict these non-spammers as spammers (a false positive).

Data drift

While concept drift is about the target variable, there’s another phenomenon, named data drift, that describes the change of the properties of the independent variables. In this case, it is not the definition of a spammer that changes, but the values of the features we are using to define them.

For instance, suppose that as a result of the previous concept drift (and the app’s popularity), there’s an update that increases the limit of messages per minute from 30 to 50. Now, because of this change, both spammers and non-spammers get very chatty and are sending a higher number of messages. Before, when we trained the model using the data from the previous app version, it learned that a user who sends more than 10 messages in a minute is a spammer. So now, it will go crazy and classify everybody as a spammer (a nightmare scenario) because the feature’s distribution has changed, or drifted.

Data drift is also caused by unexpected events we can’t control. For example, suppose that your model is doing so well and catching so many spammers that at some point, they change their spamming behavior to try to fool the model. In this case, instead of producing false positives, we’ll have more false negatives because the model doesn’t know the new conditions.

How can we detect these drifts?

Since both drifts involve a statistical change in the data, the best approach to detect them is by monitoring its statistical properties, the model’s predictions, and their correlation with other factors.

For example, you could deploy dashboards that plot the statistical properties to see how they change over time. Going back to our “messages per minute” and app update example, a plot of the average messages sent by spammers and non-spammers, before and after the update could look like this:

Another thing we could monitor is the outcome of the prediction alongside other data like its correlation to the number of active users. For example, if the number of spammers increase or decrease at a rate very different than that of the active users, there might be something going on. Note that an issue like this doesn’t necessarily mean drift. There could be other phenomena like spam waves or seasonality changes (spammers celebrate holidays, too) that could cause such variation in the data.

Nonetheless, when I refer to monitors, I do not necessarily mean to have literal dashboards (they are cool, though). Instead, you could calculate these values directly in your production system and raise alerts if an unexpected behavior arises, using either a custom implementation adapted to your data or an alerting tool like Prometheus. However, if you are looking for a specialized tool, there’s the scikit-multiflow library for Python.

The scikit-multiflow package can detect data drift using an algorithm known as adaptive windowing (ADWIN) that detects data drift over a stream of data. ADWIN works by keeping track of several statistical properties of data within an adaptive window that automatically grows and shrinks. Let’s look at an example.

import numpy as np
from skmultiflow.drift_detection.adwin import ADWIN
 
adwin = ADWIN()
 
# Simulating a data stream as a normal distribution of 1's and 0's
data_stream = np.random.randint(2, size=2000)
 
# Artificially shift the data from index 999 to 2000
# by replacing the i value with a greater one
for i in range(999, 2000):
   data_stream[i] = np.random.randint(5, high=10)
 
previous_variance = 0
# Add the stream elements to ADWIN and check if drift has been detected
for i in range(2000):
   adwin.add_element(data_stream[i])
   if adwin.detected_change():
       print("Change detected in value {}, at index {}".format(data_stream[i], i))
       print("Current variance: {}. Previous variance {}".format(adwin.variance, previous_variance))
   previous_variance = adwin.variance

In the code (inspired by an example from the library’s official documentation), we are simulating a data stream as a normal distribution of 0s and 1s. Then, in a loop, we’ll replace the values on the upper half of the list with a random integer between 5 and 10. After that, in another loop, we’ll add the array elements to ADWIN and check at each iteration if drift was detected. If so, we print the responsible value and the change in the variance.

A non-technical recommendation you could apply to detect drifts preventively is by improving the communication within the teams that, in some way, interact with the prediction model. As mentioned before, in the example of the app’s update, some data drift cases can be attributed to changes the organization introduces to the product. Granted, we could agree that with a good line of communication across the teams and a simple “Hey, tomorrow we’ll deploy this,” we could prepare the system to handle the upcoming changes in the data.

On a similar note, in the paper titled Learning under Concept Drift: an Overview by Indre Zliobaite, the author suggests that we should consider future assumptions about the behavior of the data and use models that support some sort of adaptive learning mechanism.

How to overcome these drifts?

When a model is fitted, it knows a function that maps the inputs to a label. But when it experiences concept drift, the meaning behind this label changes. In some way, we could think of this as if the model’s decision boundary (what is learned) also changed. For instance, the next image shows a trained model with its decision boundary.

Looking good, right? On one side of the line, we have the class “spam” (red), and on the other, the class “no-spam” (blue). Now, suppose that after some time t, the model starts to show signs of concept drift. Let’s see the following image.

Here is the same model after some time t, and its “new” and “apparent” decision boundary. I say “apparent” because we haven’t updated the model, and hence the line is still the same in the same position as before. However, since we are experiencing concept drift, the predictions that the model is producing are the equivalent of a model whose decision line is like the dotted one. So, if we keep the model the way it is, it will classify as spam the users who are behaving like a spammer used to when we trained the model.

So to finalize, how can we overcome these drifts? Essentially, with training. As part of your pipeline, you could implement a system that periodically trains your models after some time t, or once it detects a drift using some of the methods aforementioned. Alternatively, you could refresh a model’s weight by extending its training with new data. If retraining is not an option, a solution we could try is using adaptive ensemble methods based on SVMs or Gaussian mixture models, as stated in the paper by Zliobaite. Besides this, another alternative could be using streaming models that update their weights as new data arrives. A model that falls into this category is Spark’s streaming linear regression algorithm, that is, an implementation of the well-known linear regression model that continually updates its trained parameters.

Recap

Over time, most things deteriorate; mangoes go bad, the planet gets warmer, and machine learning models lose their predictive power — a concept known as model drifting. In this article, we introduced model drift, and two of its leading causes, concept drift and data drift. Both events involve changes in the statistical properties of the target variable and the predictors, respectively.

Here, we discussed several techniques that help to detect these events, including the ADWIN algorithm and an example. Then, we illustrated the shifting using a decision boundary as an example, and how it might look in a dashboard. Lastly, we concluded that the appropriate way to overcome this effect is through training.

Nonetheless, apart from all these recommendations and techniques, I believe the most important thing is knowing the data. Every use case, model, and organization is different, and sometimes we just can’t patch problems by simply applying the X or Y method. Instead, we should try to anticipate what could happen and build safety measures to mitigate any future accidents.

If you are looking to improve model accuracy by incorporating external data, try Explorium’s External Data Platform for free now!

About Explorium

Explorium provides the first External Data Platform to improve analytics and machine learning. Explorium enables organizations to automatically discover and use thousands of relevant data signals to improve predictions and ML model performance. Explorium External Data Platform empowers data scientists and analysts to acquire and integrate third-party data efficiently, cost-effectively and in compliance with regulations. With faster, better insights from their models, organizations across consumer goods, fintech, insurance, retail and e-commerce can increase revenue, streamline operations and reduce risks.

Industries

USE CASES

See us in action

Manufacturing Business Data

We're Hiring!

Understanding and Handling Data and Concept Drift

Concept drift

Data drift

How can we detect these drifts?

How to overcome these drifts?

Recap

About Explorium

Understanding and Handling Data and Concept Drift

Concept drift

Data drift

How can we detect these drifts?

How to overcome these drifts?

Recap

About Explorium

Related Posts

Manufacturing Marketing Strategy: A Complete Guide

Lead Generation For Manufacturers

B2B Marketing for Manufacturers