A few years ago, a friend of mine was traveling solo through Kazakhstan when he got chatting to a friendly local who described himself as an emcee. Now, my friend is a huge hip-hop fan, so he was very enthusiastic to find out more about the Almaty rap scene. “You wanna see some of my stuff?” asked the excited man. “Absolutely!” replied my friend… And that’s how he ended up spending three hours watching Kazakh wedding videos on a stranger’s phone in subzero temperatures. Because it turns out that, in Kazakhstan, if someone says they’re an M.C, they mean a Master of Ceremonies at traditional events.
No matter how familiar a job title sounds, it’s important to be sure that you really are on the same page — especially when you’re talking about something as fast-evolving and buzzword-heavy as data. For example, many people use terms like data scientist and data analyst interchangeably, even though they mean very different things. Given how similar they sound, the mistake is understandable — but in reality, they are very different.
So what’s the difference between the two? How do you derive the most useful benefits from each one, to make the most of the data you have? And in the data science v data analytics battle, which do you really need? Let’s dive in and find out.
Data science is a broad term that covers a variety of disciplines and activities that, collectively, allow you to derive predictive insights from data. This includes data analytics, but it also covers data collection, preparation, cleansing, and, importantly, building machine learning models that help you make those predictions. Ultimately, it goes beyond questions that seek to answer what you don’t know. You’re also trying to figure out what you don’t yet know you don’t know.
To do data science right, you need to figure out how best to use and enrich your data, combining the data you have internally with external datasets and contextual understanding. This allows you to build sophisticated machine learning models that explore the data, identifying new connections rather than finding specific answers to straightforward questions.
Data analytics refers to a set of skills used to draw out trends and metrics from huge raw datasets. You take massive amounts of data and comb through for relatively straightforward answers and insights, using statistics, mathematics, and related disciplines.
Typically, data analytics is used to gain a better understanding of the data you already have. It helps you get to grips with what took place in the past, so you can figure out how well you are doing and decide whether to continue in the same direction.
For example, you might ask questions like:
In other words, you have a particular question in mind and you want to find the answer, within that data. You might extend this into some simple predictive analytics, for example by extending out the patterns you observe to forecast performance or problems in the future. But this is as far as it goes.
However, data analytics is still an enormously valuable component of your data science strategy overall. By gaining visibility over what is happening inside your company, what worked and didn’t work in the past, you can start to formulate more complex questions — especially questions that predict performance in the future. This provides the groundwork for your data science projects.
Data science is primarily concerned with making predictions based on clues and patterns in the data you already have access to. This happens in many different ways, some of which are less obviously predictive than others (at least, to someone who isn’t a data scientist). With that in mind, here are the four broad categories of questions that data science can answer:
This is the most straightforwardly predictive type of question you could approach with data science. For example, you might ask, “What will our New York branch sales figures be in Q3?”, “How many orders for this product will we get in September?” or “How many subscribers will we gain next month?”.
You may be thinking, “but what’s the difference between this and predictive data analytics?”. After all, data analytics often involves forecasting future performance based on past results, too. But the big shift here is the level of complexity.
When you’re using a machine-learning algorithm to calculate your likely sales figures for Q3, you aren’t just looking at the past few years’ sales figures, and extrapolating from there. Instead, you’re drawing on multiple internal and external datasets and (if you’re doing it right) using augmented data discovery to fill in any gaps in knowledge. For example, you might incorporate broader sector sales data to see how the rest of the market is performing. You may take into account other contextual or even meteorological data — considering, for example, whether unseasonable weather patterns will impact demand for your product this time around.
What’s more, you might be modeling the impact of a change in strategy, rather than assuming that everything you do will stay the same. Your question might be more along the lines of “what impact will it have on our Q3 sales figures if we increase prices by 5%?” Or “How many product orders will we get in September if we cut ad spending by 25%?”
Data science is about looking beyond the immediate numbers and insights to appreciate the broader context; to understand how these various factors interplay.
When you buy a book on Amazon and it suggests another title for you, it’s using a data science-driven recommendation engine to do so. The model is asking “what does a person that likes X also like?” and tailoring its “X” to your past purchases, browsing history, and possibly other clues about your likes and dislikes, drawn beyond this interaction. From here, the model predicts what other books you will like — and recommends them to you.
This is far more precise than a data analytics-backed recommendation, which would likely be restricted to something like “here are our top sellers in this category” or “here are some more novels by the same author / with similar tags”, or even, “this person seems to fit the demographic profile of a buyer who likes adventure novels, so here are some more of those”.
Machine learning models are excellent at solving classification problems. That can be anything from “Does this picture contain nudity?” to “Will this customer make a purchase?”
Again, the key here is the way that the model answers the question. It’s making a prediction, based on all the other information it’s been fed in the past and has used to train its decision-making. The better the design and testing process, and the more relevant and complete the training data, the more accurately the model will be able to determine which category to assign when it’s fed real-life data.
Again, this is completely different from the way data analytics works. You could use data analytics to say “Last week we flagged up 356 photos containing nudity on the site” or “85% of new customers are women aged 40-50”, for example. But you certainly couldn’t use data analytics or BI to automatically tell whether a photo broke your site rules, or to determine, within a second, whether a site visitor is highly likely to make a purchase.
In a similar vein, machine learning models are adept at spotting anomalies, outliers, and other weird or unexpected patterns of behavior. This means that they can be used in real-time to identify things like attempted fraud, cyberattacks, or signs that a piece of equipment is about to break down.
This could also mean asking questions like “is this combination of purchases very different to this customer’s usual behavior?” or “is there something strange about these site visitors?” Real-time data analytics can partially answer questions like these — for example, you can tell very quickly if there has been a sudden change in visitor numbers and demographics or the value of individual purchases. However, it’s typically up to you to figure out whether these patterns are weird and what they mean. Data science automates a large part of the process.
Data science and data analytics are two very different approaches, used to answer different types of questions or to address different steps in the same process. While it’s important not to mix them up, the real issue isn’t about which strategy wins in the data science v data analytics battle. Rather, you need to think carefully about the question you are trying to answer and the technology and techniques that will help you to get there.
After all, the problem isn’t “what’s better: a Kazakh M.C or a rap emcee?” Both have their (clearly defined) place. It only becomes an issue if you mix them up and end up with the wrong one for your event… or politely watching hours of YouTube videos on a frozen park bench.