So, you’ve built a dataset you’re happy with, and your machine learning (ML) model ready to start making predictions left and right. Easy as pie, right? Well, it depends. Once you have a dataset that’s worth using to train your models, you need to build a feature set that will actually get you the predictions you want. The real question is, how do you go about finding the right features?
Normally, you’d look through your data, finding the columns you think will help predict your unknown variable. This includes tapping into your (or your organization’s) domain knowledge, your general expertise in data science, and other factors such as previous features that yielded positive results for similar questions. This method might even give you good results — eventually.
However, long before you see a positive yield, you’ll run into more than a few difficulties. For one, the number of features you can come up with manually is fairly limited, even if you spend weeks thinking up new ones. This isn’t a knock on data scientists’ abilities, but rather the simple fact that we’re often constrained by what we know and perhaps more so by what we don’t. We’re primed to see certain connections based on experience and expertise but tend to ignore potential links we’re not actively looking for.
The other is that even if you do manage to make an impressive list, you still need to test each individual feature before knowing if it’s valid, useful, and relevant. If you’re building a small model, or using a relatively small dataset, this isn’t a major issue. But, what happens when you need to deploy models that can scale? Or when you’re working with massive datasets? This is when you need more — you need augmented feature engineering.
Automated feature engineering is the process of taking your dataset and automatically extracting the best predictive variables — or features — for your ML model’s questions using AI-powered tools. Unlike manually doing this, automated feature engineering and generation means that you can cover more ground and look for connections in your data that may not be readily apparent to the human eye.
In today’s fast-paced world — where relevance is measured in minutes and hours, not days and weeks — your data science pipeline needs to be streamlined. Manually looking for the right features might work in an academic setting, but not if you need to demonstrate a real ROI quickly. Explorium’s AI-powered augmented feature generation gives you hundreds of possible features in seconds. More than just producing a pile of possible features, though, Explorium takes the process a step further by optimizing your feature set to give you the most relevant features for your predictive question. This includes those that are most likely to give your models a significant uplift, based on your dataset.
Let’s take a deeper look at how exactly Explorium generates your features with an example.
Imagine you run an eCommerce website looking to capitalize on a major holiday sales event coming up. This is a major day for you for two reasons — it’s an opportunity to score major revenues, and it’s also a great day to capture new customers and create repeat ones. However, it’s hard to pick out repeat customers based on the limited data you’ve collected (things like clickstream, purchase history, and a few other touchpoints on your website). This is where Explorium comes in.
The first step is to enrich your dataset using Explorium’s augmented data discovery. Once you upload your internal data into the platform, Explorium will scour our thousands of external data sources to find the most relevant data points and columns to bolster your dataset.
The process takes a few minutes, but the end result is a much broader perspective on your potential customers. Now, your website’s analytics have been enriched with datasets including demographic information, US rental statistics, individual media consumption, and even the weather. Now it’s time to start choosing features.
From here, our AI-powered feature engineering tools take over, starting with various data matching methods. Instead of running each of these individually and one at a time, however, Explorium quickly runs all of them in parallel to build a potential feature set that has thousands of entries.
Some of the matching techniques the platform uses include text analysis and natural language processing (NLP) to find possible matches in plain text, scan through social media data and search query; time series analysis to extract features on seasonality, trend changes, and the impact of specific dates such as holidays (or, in this case, major sales events); and geospatial data matching using latitude and longitude, ZIP codes, local property attributes, and footfall data.
However, it’s not enough to simply generate the features, so now Explorium will filter, test, and rank them. Let’s say Explorium scanned your enriched dataset and came up with over 1300 new potential features. These will go from the significantly relevant (such as the average time between activities, for instance) to those that may only have a tangential impact on your models (like the type of lender used in a first home loan).
Obviously, not every single feature will be equally relevant to your predictive question. So now, the platform will start optimizing your feature list based both on your dataset and on which features offer the biggest uplift, so that out of an initial 1300, you might find the 50 or 100 features that are most impactful.
For example, after enriching your customers’ ZIP codes, you might find whether a couple is married, or even the minimum temperature over two days (using an enriched purchase date combined with the customer’s ZIP code) could be important predictors. Even better, Explorium lets you scan through the features yourself and choose different varieties to test and create individual lists.
More importantly, you can write your own code — known as “creatives” in our platform — which lets you add your domain knowledge to the feature generation process. This way, you can steer the platform to specific features and ideas that you think could have a big impact on your models. This means that you can dig even deeper into the Explorium Enrichment Catalog to find the best features and data for your predictive questions.
Using your top features, you can now start training a variety of different ML models to find the one that gives you the biggest uplift and best results based on your predictive question. Instead of simply trying everything to see what sticks, Explorium will use your unique selected feature sets (or even the complete list if you want) to train a variety of different models. You’ll be able to see how each performs using the features you selected, including the AUC, precision, and accuracy scores.
Let’s go back to your eCommerce website. After selecting the top features, you run a few different models, and you find after reviewing the highest scoring one that the top features include the average time between activities on your site, whether the person is a female aged 21 years old, and whether they’re a heavy Twitter user. Using this, combined with your existing data, you can start making better predictions and tailor your advertising and even dynamic websites to better fit each customer and build a much larger repeat customer base.
Finding the right features for your ML models is crucial, but not when it comes at the expense of your scalability and ability to deploy when you need. Augmented feature generation and engineering with Explorium gives you the best of both worlds — thousands of potential features, delivered in seconds and ranked by relevance.
Even better, instead of spending weeks testing each individually, you can know which features will deliver the biggest impact to your models immediately. And, if you have specific domain knowledge, you can build even better features with Explorium’s creatives. If you need to build scalable, impactful models, automated feature engineering is no longer optional.