Here’s a question: does your organization have all the information it could ever need to succeed? I’ve yet to meet a business leader whose answer to that is “Yeah, we’re good — no more knowledge or insight required for us, thank you”.
The more you know and understand about your business, your industry, your market, your economic context, and your customers, the better-informed your decisions will be. In data science, feeding increasingly nuanced, complete and connected data into your models leads to accurate, insightful predictions about the future, too.
That’s why more and more organizations are looking for high-quality external data to augment their existing data, adding depth and detail to what they already know.
But if you aren’t collecting this data in-house, then where does it come from? Who out there has the answers you seek? What are the most useful, accurate, up-to-the-minute sources? And once you’ve identified them, how do you get that data into your own organization (and into your own data science platforms) swiftly and seamlessly, in a format you can use, ironing out any issues and merging it with your existing data sources?
In a nutshell, data acquisition means finding and bringing in data from outside your organization. In other words, any data that you need to “acquire”, rather than what you already have in-house.
Broadly speaking, the term covers the processes of collecting, organizing and cleaning data before you can store it in a data warehouse or feed it directly into a platform for BI, predictive analytics and/or machine learning (ML).
Exactly how you manage this from a technical perspective depends on the 4 Vs (Velocity, Volume, Value and Variety). But at a foundational level, you need to think carefully about how this data will intersect with your existing data. How you’ll make sure it’s compatible with your current datasets so that you can quickly bring all the information together into a single version of truth. How you’ll then prepare it and feed it into your ML models.
And that, of course, takes planning.
Machine learning and advanced analytics require a lot of data. Not just any data, though. The right data — insight-rich, relevant and up-to-date.
Wading through the oceans of data created every second of every day to find the most valuable information for your models would be hard enough if you only had to do it once. But this isn’t a one-time thing: you need to keep overseeing your data sources and pipelines constantly, examining your models to figure out what you’re missing and how they can be improved. This makes data acquisition an ongoing process.
If you treat data acquisition as an ad-hoc task every time you do it, you’ll get unnecessary delays and inefficiencies every time, too. You will waste time looking around and figuring out connections whilst potentially overspending on datasets that are of limited value just to get to the slither of genuinely valuable data each of these contain. That’s not sustainable. It doesn’t make economic sense.
This is why you need a carefully developed, tried-and-tested strategy to get that process right.
The key is to start with the problems you are ultimately trying to solve.
This is likely to fall into one of two categories. Either:
Each of these priorities demands a different approach. Let’s start with the first one.
Say you have ambitious plans to answer a new question or solve a grand problem. With this kind of challenge, the chances are that the data you have internally doesn’t bring you close to answering the question. You need to think outside the box for this: what alternative sources of data could contain those big answers?
For example, let’s say you’re trying to figure out how a completely new product you plan to launch will fare in a particular market. You don’t have any relevant sales figures of your own to work with, because the product is new. So where can you source useful data to build a predictive model? You might want to look at market trends and industry-wide sales patterns of similar products. You could track and analyze social media mentions to see if a buzz is being generated. You could look at the financial performance of your competitors, for example with reference to their business filings and financial reports.
On the other hand, if you are trying to optimize an existing system, your focus isn’t so much on starting from scratch and finding the holy grail of alternative data sources. Rather, you’re making incremental improvements to your datasets or ML models. You’re thinking carefully about where the gaps in your knowledge are and what external sources could fill in the final piece of the puzzle. You’re augmenting your existing datasets, not replacing them.
That could mean accessing seasonal or weather data, or footfall traffic to add context to your own transaction or sales data. It could mean adding more sources of demographic data to your churn models to help you figure out which customers you’re losing and why. The intended result might be a modest lift in sales, revenues or customer lifetime value, but which actually adds up to a considerable boost to your bottom line.
Data acquisition is hard. Accessing the right data sources, making sure you don’t make dud purchases, feature engineering, setting up the right production pipelines, auditing… all of these tasks are complex to manage and easy to mess up, which can be frustrating when you’re on a tight turnaround to get insights and results.
That's why it’s vital to factor the tools and platforms you’ll use for all this into your planning from the beginning. How can you automate data discovery? How can you make connections to external datasets easier? How can you streamline cleaning and formatting tasks? How can you ensure that you’re paying just for the data points you need within a dataset? A top-end data science platform takes care of a lot of this for you, leaving you with the time and headspace to focus on the important part: strategy.