Why automating data science will kill the BI industry
Machine learning models are mathematical models that leverage historical data to uncover patterns which can help predict the future to a certain degree of accuracy. And when it comes to running a business, the ability to predict and make data-driven decisions (from the ability to identify customer churn before it happens, to optimizing promotions and predicting loan defaults) based on those models is definitely a useful weapon to have at your side.
But while predictions are just one useful aspect to leveraging data, some tasks are simpler than that – sometimes you just need to extract meaningful patterns, or derive actionable insights from data to make high-level decisions. For those tasks, you’ll probably choose to utilize standard BI tools and data visualization dashboards to deduce patterns through observation and human interpretation.
But the patterns humans search for, aren’t they the same patterns that machine learning models are searching for as well?
The answer is yes. Well, sort of. Whereas BI may help us visualize historical data, we do so through a pair of human eyes, giving us a shallow understanding of the data given the cognitive bias we all have and the limited number of dimensions we’re capable of looking at. Machine learning models, on the other hand, can infer more complex rules, deeper patterns, and interactions between different dimensions and different variables.
So, what if we analyze the model itself and not the data that it learns from?
Deriving insights from machine learning models
Let’s look at the following analogy:
It’s your first day as a top-level executive in a brand-new industry you’re still not quite familiar with. You have a huge decision to make and only a few hours to make it in. You’d like to get more information, so you head to the library where they refer you to 20 different books to read in order to get the information you need.
Now, most of the books aren’t hyper-focused on your topic so the actual information you need is spread across various chapters and pages. Obviously, getting all the information you’re looking for without missing a single piece of the puzzle is pretty much impossible. You simply cannot read all those books, let alone cross-check the information between them to derive the necessary information needed to help you make that decision, especially in the few hours you have.
That’s books are a metaphor to all data sources out there that could help you gain insights on your business and the questions you may have.
When your looking at what drives you monthly revenue, do you look at your CRM to identify clusters of good and loyal customers? Or do you look to see if certain holidays and events cause peaks in sales? Or maybe it’s those marketing campaigns you’re running; maybe certain promotions are making a bigger impact than others.
Let’s go back to the books analogy for a second.
What if you had some help extracting the relevant information from those books? An expert who has studied these books for years and years and could point you to the specific pages you need to read from every one of those 20 books?
This expert will take it a step further. They’ll even show you the relationship between the books, how they complement each other, helping you uncover even more knowledge and insights about the terms and concepts generated from the combination of pages from all the books.
That expert? That expert is a metaphor for a machine learning model.
And this model solves the mathematical task of predicting accurately; it learns the patterns that are most likely to explain a certain phenomenon- like what caused it. It’s basically a machine that can look at all the different visualizations at once, all the pieces of information together, in order to get the right answer every time to a given question.
When we opt for using complex models that are capable of discovering patterns on a deeper level than any human can via BI visualization, we’re actually capable of automatically understanding our data better.
While techniques to analyze machine learning model’s are out of scope for this blog post, let’s look at a simple example:Let’s say we’re trying to understand which customers are likely to take out a loan in the next month. By analyzing a decision tree, we’ll be able to get the specific rules and patterns derived from the data; This would probably take us much longer to come up with on our own if we were reliant on our day-to-day BI tools.
It’s kind of like magic, right?
So why are people still using BI more than the modeling alternatives?
When machine learning is overkill for deriving insights
Building a predictive model is not a simple and cheap task.
First, you’ll have to hire a data scientist and pay them tons of money.
Then, you’ll have to play the waiting game until they actually build the machine learning model they’ll use to analyze your data.
Because building a machine learning model that is actually learning something, which is accurate, robust and fed with the right data, is very, very complicated.
Machine learning models are spoiled creatures. You can’t just take an algorithm, point it to the relevant data table and say to it: “Go do your magic thing with the patterns now, you smart machine you.”
You’ll have to spend some time preprocessing the data (e.g, scaling numerical columns, converting categorical variables to numerical ones), choosing the right algorithm, tuning it and complete many more tasks that are a must in order to build a useful machine learning model.
Yes, even “just” for the task of deriving insights.
With that being said, given a simple dataset, an experienced data scientist might not spend too much time on it.
Especially by using some predefined scripts or modules to automate the modeling work.
BUT, and that’s a big but, what if you want to get insights from multiple sources of data?
What if you want to uncover interactions between holidays AND your Facebook campaigns? What if you want to find out how different attributes from you CRM, such as age, gender, etc., interact with the customer activity from your product logs, in the context of predicting a customer’s probability to churn? What if you’re thinking what’s the impact different promotions have in different store locations?
That’s way more than just tuning a model on a simple dataset. Now, the data scientist will have to fetch multiple sources of data, find ways to match and connect them correctly, extract meaningful variables by aggregating different columns (should we extract the slope of transactions amount or the number of peaks? Aggregate by month or by year?) and spend more time preprocessing much more data.
That’s where the task of building a “model-just-for-insights” becomes infeasible, or at least not cost effective.
This is when you’ll probably choose to rely on your good old BI tools to analyze the small portions of data you can get your hands on, instead of spending time, money, and resources on a data scientist who will build a model that won’t even be used to make predictions.
The mature, commoditized landscape of BI tools are probably a better fit for those tasks because those tools require less expertise and can help you accomplish basic tasks in less time. Heck, some of those tools even allow you to build dashboards on top of multiple sources of data.
Automating data science will automate insights as well
While most of the tools and platforms built for data scientists in the past couple of years were putting their focus on the algorithmic layer (the selection and tuning of the model), the new category of automated data discovery and feature generation aims to disrupt and automate the more challenging task of finding, combining and utilizing multiple sources of data, automatically generating variables, while tuning the whole pipeline from data to model to insights, in a streamlined way.
Letting a machine explore thousands of columns, and millions of possible interactions between those columns while taking care of joining the datasets together – means a new era for discovering insights across siloed data sources.
In a split second, you might discover insights such as:
- Customers between the age of 17-24 (CRM) who are college educated (professional profile enrichment), and have seen the video-ad (online marketing data) are a key driver to your revenue
- Small businesses in the beauty industry (e.g. hair salons) who have a lot of competition in the area around them (location-based enrichment) and low reviews in online business reviews websites like Yelp (business web-based data enrichments) are with a higher risk of not repaying a loan.
- Stores with the 1+1 promotion (historical promotion data) located in an area with a higher percentage of young families are selling more products and are at a higher risk to experience out of stock.
Notice that all of the above insights could be derived from BI visualizations. They could, but it would just take a lot of time and you’ll probably miss a lot of those insights. Automated data science could mean breaking the human barrier of thinking. Nowadays, when we use our BI tools, we’re basically searching where we already know we will find answers.
By combining multiple sources and scouring deeper patterns, derived by analyzing thousands of automatically generated features fed into a machine learning model to model interactions, we will see a disruption to the very mature and commoditized BI landscape – marking a new tectonic shift from BI to AI.