Table of Contents

    The field of machine learning is currently experiencing a rapid and booming expansion that seems to have no end. It seems like every other day, the ML community publishes a new and exciting paper that shows the latest state-of-the-art algorithm or software for a use case that two weeks ago had a not-so-state-of-the-art solution.

    These advances are great! Thanks to them, many facets of developing a machine learning system has been simplified. For example, today, building a model is just a matter of .fit(), .predict(), and enjoy the results.

    However, an area that seems not to be getting that much attention is that of real-time machine learning models in production. As more enterprises adopt ML in their practices, the topic of production ML is becoming as important as the examples previously mentioned.

    The lifecycle of a production model is not straight-forward. Such systems involve multiple parts such as deployment processes, training scripts, monitoring, and data source management. This article is about the latter. We’ll discuss challenges that arise while maintaining a real-time production model that employs multiple data-sources.

    Feature Generation: The Next Frontier of Data Science

    By itself, an ML model is simply a black box that eats data and produces an output. That’s it, a prediction or a decision. Usually, they reside inside a .py file, are mentioned in some PowerPoint presentations, and are used for menial tasks such as detecting coconuts. However, what happens if the whole world is suddenly interested in detecting coconuts with ML? How can we evolve and scale this model to delight society with our groundbreaking tool? Enter: a production system.

    In layman terms, a production machine learning system is one that is deployed and continually being used. For example, Netflix’s recommendation system and your new coconut detector. 

    The tricky thing here is that typically production machine learning systems are meant to be used in real-time; after all, we want to detect coconuts before they fall on your head. Furthermore, the actual ML code is only a small fraction of the whole ecosystem, which is made of other parts such as feature extractors and analysis tools. Let’s not forget the data sources, either. These data sources are crucial to the system. However, having multiple data sources inputting data into a real-time model is a challenging task.

    Consistency

    The first difficulty with inputting multiple data sources into a real-time model is related to the consistency of the data. When having various inputs that deal with the same kind of data, it’s essential to know if the format is the same. For example, in operations that use time from different sources, these times should have the same precision (e.g., milliseconds, nanoseconds), and probably the same format (e.g., ISO 8601). The problem is even worse when dealing with complex formats like Protobufs. In this case, we have to assure that each source is updated with the latest definition of the structures.

    Out-of-order data

    Another problem is out-of-order-data, which is more common in real-time streaming data sources. Here, you might encounter a case in which one stream is behind, or delayed when compared to others. Under these circumstances, a real-time model that requires ordered or sequential events (e.g. a recurrent neural network that uses a sequence of credit card transactions to predict if the next one is fraudulent or not) could experience unintended behaviors.

    Scalability

    A general issue you need to be on the lookout for is associated with the scalability of the platform, a vital concept in production systems. Let’s say that your coconut detection app goes viral in the middle of the World’s Best Coconut Championship, and suddenly you have 120,000 participants downloading the app and triggering detections. In this situation, we need to ensure that your backend and data sources currently receiving images from the Cocos nucifera fruit can handle the sudden load and gracefully pipe the photos to the model.

    Old data

    The more data sources we have, the higher the risk of consuming old data. Take, for example, delayed data, which can be detrimental to a model. Let’s say that you manage a predictive model that relies on data that’s not updated often. Then, you trained the model with new data from your warehouse, and deployed it, resulting in a situation in which you have a new model, but the live data used for predicting is outdated. 

    A similar situation can occur when a model is trained with data from a particular version of your product and then said product gets updated. For example, in the coconuts app, users can “like” a maximum of 500 cocos per day, and you use this data for predicting churn. Then, for the next update, the product owner decreases this limit to 200, and you haven’t updated your churn model. The outcome is a model that might produce a non-desirable number of false positives (and an angry boss).

    This last phenomenon is similar to what Google calls “entanglement” in a paper by Sculley et. al. This concept describes the fact that modern ML systems are so complex and connected to multiple sources – thus the term entanglement –  that any single change in the distribution of the input data could cause unwanted changes in the importance and weights of the model.

    Conclusion

    Owning and maintaining an ML system is no longer a matter of having a model sitting somewhere producing predictions, but an ecosystem involving many parts. With this comes several challenges like inconsistent data, feedback loops, and entanglement. As the world of machine learning continues to evolve it’s vital to remember that code is only a small fraction of the whole ecosystem. Data sources are crucial to the system but if not maintained and managed properly can significantly affect and harm the performance of the model. 

    Feature Generation: The Next Frontier of Data Science