Finding the right vendor for alternative data is really hard. Each has their own processes and schema, down to the format the datasets are provided in and the mode of delivery. Even if you’re buying access to a single data source, you may struggle to differentiate between those offered by different data providers — and getting it wrong can be an expensive mistake. If you’re looking for a data provider to be a one-stop-shop for all your data needs, the last thing you want to do is get locked into a vendor you can’t rely on.
Ultimately, though, the best way to judge a data provider is on the data they provide. This data dictates the future success of your machine learning models. If you use data that’s subpar or unfit for purpose, it will put you in technical debt. You’ll create problems in your models that, inevitably, will have to be fixed. The further along the project you are, the harder that will be. That’s why it’s so important that you get the right data, right from the beginning.
It’s also vital that you view both the data you have and that you acquire in terms of ROI, rather than simply as an unavoidable cost. Data should be a highly valuable resource that leads to more effective (and ultimately profitable) decision-making. That means you need to think carefully about whether your data provider is able to offer data that meets the standard you need, in a format you can use and delivered in such a way that it can feed straight into your data pipelines. You aren’t simply comparing data providers based on the price of an individual dataset, but rather on what this means for the success and efficiency of your machine learning project, the value of your predictive insights, and what this means for your business in the long run.
In short, comparing data providers starts with comparing the data they offer. When deciding if they’re worth it, it helps to focus on three key areas: coverage, quality and relevance.
The first question to ask is: does this dataset match my question?
A dataset may seem perfect at first glance, but make sure it really does tell you what you need to know. For example, if you’re looking at data on house prices, does it actually cover the geographic region you’re examining? If you’re looking to bring economic data into your models for context, does it cover the same time periods as the other datasets you’re using?
Next question: how accurate is this data?
The main thing you’re looking for is gaps and errors. Are there a lot of empty fields? Or glaring omissions of key data points you’d need to make this data useful?
Also, is it already out of date? Sometimes the answer is obvious; if the dataset still lists Steve Jobs as CEO of Apple, obviously don’t use it. But it’s particularly important to place close attention to this issue right now, as things are changing so fast and you could easily get caught out. Major brands that were financially sound a year ago have filed for bankruptcy amid the havoc of COVID-19. Political upheaval has seen unexpected changes in government leadership. Market trends that seemed established in January 2020 bore no relation to facts on the ground in January 2021.
Finally, ask yourself: does this data help to answer my question?
Even if a dataset has excellent coverage, the actual data points it contains may not be at all relevant to the question you are trying to answer. Let’s say you’re trying to figure out where to locate physical branches of your store and you want to know where your target customers spend their time. The dataset you’re looking at might contain fantastic, comprehensive data on customer footfall and sales trends in the locations you’re considering, but if it’s missing information on the demographics of those shoppers, and that’s the piece of the puzzle you’re missing, the dataset is not relevant to your question.
Note that external data rarely take the place of internal datasets entirely. Usually, you’re looking to augment or extend the data you have with additional information and insights, filling in gaps, providing additional context and generating new features to increase its value and fill in any gaps that hold back your machine learning project. This means that the data you are acquiring has to be super-relevant to your existing data and the details it’s currently missing. You’re looking for incredibly granular, specific data points that slot into and enrich your existing datasets.
Note that it’s much easier to do this if your data provider offers a platform that automates both your connections to external data sources and the process of searching for relevant data points. Otherwise, it's kind of a needle-in-a-haystack scenario.
If the past year has proved anything to data scientists, it’s that economic, market, social and political realities can change utterly in the blink of an eye, making all your historical data and trends redundant. This means you need a way to tap into new signals and vast amounts of up-to-the-minute historical data without delay, to make sense of these changing realities and predict what’s going to happen next. Unless your data provider has an easy way to connect to continually updating datasets — and to automate your search for new, relevant data sources — you simply won’t be able to keep up in a crisis.
Unfortunately, this makes choosing a data provider even more difficult. Not only are you judging them on the quality, relevance and coverage of the data they currently provide, you’re also making a call on whether they’re set up to continually obtain and provide the best data going forward. Don’t leave this part of your assessment up to chance. Talk to each potential data provider about how they navigate this challenge and how they’ll help you update your models with the right data during times of upheaval.
Read how Melio Payments fuel hypergrowth with better alternative data in this case study.