Hero image
Whitepapers

Data Scientists and Augmented Data Discovery: A Match Made in Heaven

Instead of deploying their talents and knowledge, most data scientists today are still forced to slog through the data discovery and acquisition process, which can take anywhere from a few days to a few months.

What if you could cut that down to a few minutes instead? Data science automation has historically focused on hyperparameter tuning and model optimization but now it’s time to see how new tools can empower data scientists to use more and better data.

This whitepaper covers:

  • Why augmented data discovery is a major benefit, not a major drawback
  • How innovations in machine learning are focused on simplifying the data discovery and acquisition process
  • How augmenting data discovery enriches your machine learning models and gives you better results

Augmented Data Discovery

Imagine there’s a huge catalog of all of the data in the world and you’re able to automatically join your training data with any relevant item in this catalog. For example, if your data science team has “company name” as one of the columns in your training data, this tool can automatically join it with financial data from Bloomberg. Or, if your team has a product SKU in the training data, the tool can automatically join it with Amazon customer reviews.

Using this new capability we could expand the definition of an instance in the search space to not only include machine learning algorithms and hyperparameters but also to include joined datasets. Now we will be able to run a search through this new vast search space to find the best combination of data, algorithms, and hyperparameters.

There are two main building blocks missing to actually implement such a search:

  1. A catalog: we currently do not have a catalog of all of the datasets in the world and building one is a monumental challenge.
  2. Efficient search algorithms: The new search space is enormous! In order to run a search through it we will need to develop new algorithms to efficiently choose instances in the space to evaluate. Just like random and grid search in the hyperparameter tuning case evolved into Bayesian optimization and TPE, we will need to come up with new methods to run an efficient search.

Luckily, tools are being developed to handle both of these challenges. One of the biggest innovations driving machine learning today is the ability to scour thousands of datasets  — both private and public — without having to individually scan each one for matches or relevance. This new searchability means that machine learning models are less reliant on internal data, and can perform better both in testing and deployment due to a much broader foundation.