Instead of deploying their talents and knowledge, most data scientists today are still forced to slog through the data discovery and acquisition process, which can take anywhere from a few days to a few months.
What if you could cut that down to a few minutes instead? Data science automation has historically focused on hyperparameter tuning and model optimization but now it’s time to see how new tools can empower data scientists to use more and better data.
This whitepaper covers:
Imagine there’s a huge catalog of all of the data in the world and you’re able to automatically join your training data with any relevant item in this catalog. For example, if your data science team has “company name” as one of the columns in your training data, this tool can automatically join it with financial data from Bloomberg. Or, if your team has a product SKU in the training data, the tool can automatically join it with Amazon customer reviews.
Using this new capability we could expand the definition of an instance in the search space to not only include machine learning algorithms and hyperparameters but also to include joined datasets. Now we will be able to run a search through this new vast search space to find the best combination of data, algorithms, and hyperparameters.
There are two main building blocks missing to actually implement such a search:
Luckily, tools are being developed to handle both of these challenges. One of the biggest innovations driving machine learning today is the ability to scour thousands of datasets — both private and public — without having to individually scan each one for matches or relevance. This new searchability means that machine learning models are less reliant on internal data, and can perform better both in testing and deployment due to a much broader foundation.