Table of Contents

    Congratulations! You’ve embraced machine learning and data science and your organization is well on its way to building a system that helps you deploy predictive analytics for greater insights. You’ve already done a lot of the heavy lifting, from finding the right platforms and tools to help you build your models to understanding the steps required to empower machine learning in your organization. The good news is that you’re on the right track to building a data science infrastructure. 

    The better news is that now you’re ready to start going beyond your own data and into the wide world of external data. Running machine learning models exclusively on your internal data is a great start, and it can give you some powerful insights. However, the results will always be somewhat limited, and may even fail when presented with data and conditions that are radically different from the historic ones they have seen. 

    The answer? Feed your models external datasets that can give them greater context — everything from shopping trends to social media and even census and demographic data — and therefore offer you deeper insights. However, where does one acquire this data? And how do you best use it? 

    machine learning training data

    You could certainly look for open databases and collect terabytes of raw data, which are readily available across the web for free or for a fee. You would probably quickly run into a variety of issues, from an excess of irrelevant data to the sheer difficulty of cleaning such a massive database to make it usable. Even so, you may think that it’s just the easiest path to getting the data you need. However, there’s a better way — you just have to find the right machine learning data catalogs for your needs. 

    External databases: great idea, not-so-great execution

    If you choose the external database path, however, you would soon find some not-so-great news. All of those massive sources chock full of data? They might not be as useful as you thought. The problem isn’t that they don’t have enough data — quite the opposite. These giant databases might have too much data, making it hard to determine what’s relevant, and how much work you need to do to make it usable by your models. 

    Databases aren’t always concerned with doing the legwork to clean the data they hold — they’re simply repositories for thousands of data points. This isn’t a problem when you have a team of data scientists with unlimited time to clean, parse, and harmonize the data within them. However, if you’re just starting your data science journey, such a massive undertaking is just not really feasible. 

    The problem is that while having more data is generally a good thing, having a database that isn’t properly labeled, or that comes with a lot of irrelevant data or poorly formatted data, can significantly hamper your ML models’ efficiency and insights. The issue is compounded when you add more databases to the mix without taking the time to properly prepare them. 

    Imagine you’re trying to build a new promotion, and you want to understand the best locations to launch your products to maximize your revenue. You find a nice, big database full of previous sales numbers and other marketing metrics. It’s awesome, but almost instantly you run into problems.

    For one, you check the database and you notice that the table has columns from multiple sellers who all have their own labeling systems. Moreover, some don’t even collect the same data, so you’re stuck with some columns full of data you can’t even understand, and others that are empty except for three values that include “NaN”, “Don’t know”,  and “Other: unspecified”. Sure, you can eventually sort it all out and get some great nuggets from this, but you’ll likely spend a month just trying to figure out what you’re looking at. By the time you’re ready to run your model, it’ll likely need an overhaul. The question, then, is how can you make all this data work for you now?

    Data catalogs: the easier way to find external data 

    Let’s go back to your massive, garbled database one more time. Let’s imagine that instead of a jumbled mess of columns, labels, and null values you found on some random website, you’re offered a better way. Instead of random values and null categories, every data point has been harmonized, the numbers match up, and even the columns are logically sorted. Much easier to work with right? 

    If databases give you an often confusing mess of data points, data catalogs give you the opposite. When you’re looking for external data, what you really need is a data source that you can quickly plug into your machine learning models to train them, without having to worry about how they’ll mess up your predictions. Data catalogs differ from your standard database in one crucial way — they’re curated and ready to use. 

    Let’s say you run analytics for a manufacturing company, and you need to understand consumer trends in order to plan your production schedule for the next few months. A database may have all the answers you need if you can spend the month required trawling through thousands of data points to find the few you need. 

    A data catalog, on the other hand, will include a fully curated collection of datasets that have already been cleaned, vetted, and prepared for training and testing. Instead of wondering about how users are interacting with your products, you can find demographic data, social media sentiment analysis, and even external trends data that add context to your existing models. 

    In this way, data catalogs solve two of your biggest problems: having the right data to feed your models, and having it exactly when you need it, and not three weeks from now. If you’re looking to move your machine learning models into production quickly, you need to know the data you have for them is the best, and that you’ll be getting gold whenever you connect to an external data source, not mountains of lead. 

    Quality over quantity, especially for your ML data

    It’s tempting to think that since ML models require training, you need to feed them as much data as possible, but there’s a limit to that approach’s effectiveness. Much like the jumbled notes for your biology test, you may find some great insights in a mountain of useless data, but the cost of finding it — in time, resources wasted, and even money — is simply not worth it. 

    On the other hand, finding the right data catalog — one that already cleaned the data and shows you only what’s relevant to you — can get you much better results faster and for a fraction of the cost. Instead of breaking your head trying to find a needle in a haystack, simply buy a catalog full of needles, and see how quickly your machine learning models go from good to great.

    machine learning training data