Table of Contents

    Close your eyes and imagine a machine learning production pipeline (come on, you can do it). What do you see? Oh yes, data sources; a data warehouse, a streaming queue, you know them. Now continue. What else do you see? Maybe an ETL service that is regularly cleaning, munging, and merging this conglomerate of bits and bobs that are running across the data sources. Then what? Is that a feature extractor? Of course, it is! And it’s gathering all the essential and secret features that your hungry machine learning model needs to detect, for example, whether a drink is a piña colada or not. And lastly, next to the extractor, is the story’s protagonist: the model. There it is, working hard, scrutinizing all these thirst-quenching images. 

    Is there anything else? What’s happening after your machine learning prediction? What’s the next step in your pipeline after determining this is indeed a piña colada? Or is this the end of the pipeline? If your answer is “yes,” it doesn’t have to be.

    During the time I have been in the data field, I noticed that many of the data pipelines I see or read about typically end after the machine learning prediction has been made. And that kind of makes sense, right? The whole point of a machine learning model is to infer something and produce an output. However, if we wish to take things further, we could start seeing that prediction as a new piece of information — as more data. And what do we do with data? We use it! And I mean we use it now. Live, in production. But to do so, first, we need to find a suitable use case for this new data point. Then, we extend the pipeline and include it. Let’s have a look at some use cases I have personally applied, and others I would consider appropriate. Case Study


    I’ll start with a quintessential example —  logging. Usually, if someone is interested in knowing the prediction labels of the system, the most ideal and simple way to do this would be storing the said value in a dataset and examining them later. And this works, as long as you’re continually running a query to summarize these values. But, I believe there’s a better and more direct way to assess these numbers, and that’s with metrics. Usually, when you hear the word “metrics,” you think of scores that measure things like latency, system liveness, or the number of requests per second, but not machine learning-related values. 

    Now, let’s suppose you start logging all sorts of information about our prediction output. This can include the predictions’ distribution; the ratio between labels; the average confidence value, prediction by gender, age, and so on. You get the idea. Having these metrics, preferably in a dashboard, would allow us to assess our models quicker and better in a single glimpse. Even more, with these metrics, we could also recognize unexpected behavior, which brings me to my next topic: anomaly detection.

    drink detection model

    Fake dashboard I quickly did on Photoshop. I’m sure yours will be better.

    Anomaly detection

    Let’s admit it. Things go wrong. As fancy and efficient your model is, eventually, something unexpected and unusual will occur. Fortunately for us, since we’re already logging and collecting our prediction outputs (right?), detecting these anomalies is just a matter of adding an extra component to our ever-growing pipeline. But what exactly should we be looking for? Where could these anomalies unfold? Well, that depends on your predictions. But for starters, we could start by making sure that the prediction’s output distribution falls within the three-sigma-rule.

    Let me give you a concrete example. Suppose that 90% of the predictions our drink detection model makes are piña colada. Then, for some unknown reason, in the last 15 minutes, that number suddenly dropped to 30%. This behavior is, of course, not normal and might indicate that something wrong is about to happen. 

    A more severe example would be if your company’s fraud detection model goes abruptly haywire and starts labeling each transaction as “fraud.” Here, either your model is suffering some unwanted effect, or the World Scammers Conference has just started. So, to avoid such surprises, and to prevent a bigger problem, I’d highly recommend implementing an anomaly detection system.

    drink detection model dashboard

    Look at that drop!

    Double-check the prediction

    If there’s one takeaway I want to make sure you start using right away it’s this one: double-check the prediction (I’ll bold it so you can see how serious I am). 

    Machine learning is not perfect. In fact, I like to say that machine learning is as accurate as Michael Jordan’s free throws. Meaning, it’s pretty good, but every now and then it’s going to fail. Here, failing means giving a result that doesn’t make any sense. Thus, we need measures and fallback options to bring some sense back. So, I propose building an extra component in your pipeline with the purpose of consuming the prediction’s output to validate it against a set of basic rules (you’ll probably need a bit of domain knowledge here) before making a decision.

    The goal of these rules is to accept or reject the prediction. They can be as elementary or complicated as you want. To start simple, I’d suggest implementing a “confidence checker,” which is a module that receives the predictions’ probability value and compares against a certain threshold defined by a human. The said threshold should be interpreted as follows: if the value is above the threshold, then we execute and continue the normal flow, if not, we don’t perform, and call the fallback option instead. 

    We can even take this a bit further and add an extra step. Imagine you have a drink multiclass classifier that predicts either piña colada or coconut shake (quite a fun problem). After a while, you learned the following:

    • Usually, if the prediction confidence is ≥ X%, the drink is most likely a piña colada.
    • If the confidence is < Y% the drink is most likely a coconut shake.
    • Everything between p(x) X% and p(x) < Y% is most likely a false positive.

    In a case such as the last one, I suggest you follow the “predict less” principle explained in this talk by Vincent Warmerdam. In it, he states that in cases where the prediction’s outcome is uncertain (for example, the confidence lies close to the thresholds) we can ignore the prediction to avoid unexpected side effects.

    On the other hand, we have a situation in which we, the person behind the model, is well aware of the domain where we’re predicting on, and so we can develop more complex rules to disregard the prediction if it doesn’t make any sense. 

    For example, suppose that you have a model that recommends a drink based on a series of ingredients given as input. Then, one day the system suggests you a piña colada even though there’s no coconut or pineapple in your list of ingredients. This result is wrong. Without those two ingredients, there can’t be a piña colada. Because you’re a drink expert, you know this. So, you go to the pipeline and add a new block that checks and fixes false-positive cases like this one. This might be a function that checks if the input argument “pineapple” exists in a list of “must-have” ingredients. If it does then we proceed. Otherwise, we don’t.

    Recap and conclusion

    The ultimate goal of a machine learning system is the prediction. However, this doesn’t have to be the end. The algorithm’s response, like every other data point from your collection, should be considered and analyzed because it might be hiding facts that could be beneficial to your system. In this article, I introduced the idea of extending a machine learning prediction pipeline with new processes and presented three ways in which we might use this data point. 

    The first case is logging, which suggests that we should record our prediction result (and other metrics related to the model) to have a clear overview of its performance. Then we talked about anomaly detection, a technique that could help us find irregular patterns within our model’s responses. Lastly, I explained several measures – human-defined rules and the “don’t predict” principle – that ultimately decide whether our prediction should go through or not.

    The concepts I’ve presented here aren’t an exhaustive, nor perfect, list of things that you should implement in your system. They are just mere examples of things I’ve personally applied and yielded a positive result. Notwithstanding, I fully understand that every system and enterprise is unique and has different requirements, so some of the things I’ve described here might not work for you. Still, I’d like you to consider some of these ideas, and even better, evolve them into something that could fit your model (even if it’s not about piña coladas). Case Study