Contact Menu
Back

Demystifying Feature Selection: Filter vs Wrapper Methods

August 11, 2019 Aviv Nutovitz Data Science

As data sources around us multiply exponentially (both in volume and variety), data science teams have the potential to generate more and more features for their organizations.

For example, building a stock prediction system a couple of years ago would probably mean relying primarily on revenue related features. But now we can grab large amounts of accurate online information – employee growth rates, online reviews, web traffic, search, C-level profile, and more.

So, given the fact that more and more features are becoming available for machine learning projects, feature selection algorithms are increasingly growing in significance. In this article, we will cover (and compare) two popular feature selection methodologies – Filter and Wrapper.

The Problem

At Explorium, we have teams working around the clock to find the best data possible to improve our customers’ prediction models. But with the number of possible accessible features for the models growing exponentially, this has become a difficult task. My team is responsible for locating algorithms and feature selection strategies. I would like to share some of my experiences.

In order to examine the two feature selection methodologies, let’s take a look at a small sample of our Melbourne prices dataset. After running the Kaggle data set on the Explorium platform, we can categorize all table columns into 3 groups. Let’s lock into a small subset of them. The orange colored columns are features from the original dataset.

The Data

The blue color columns are enriched features generated automatically from our thousands of data sources and the green column is the original target. Let’s focus on 3 features and rename them for easier understanding. Let’s use the following feature naming scheme:

  • “Rooms” -> R
  • “median_monthly_mortgage_repayments” -> M
  • “average_school_ratings_in_the_area” -> N and _TARGET_ -> Y

At this point, all the generated features will be clean and normalized, before being thrown into the feature selection phase. Explorium’s platform will then run several machine learning models and output the best one of the selected features. The feature selection process runs several strategies and selected parameters that will be tested constantly.

In the following sections, these parameters will be referred to as algo_config for simplicity reasons. Furthermore, the algo_config is used to mimic statistical tests in several confidence levels.

This blog post will focus on the feature selection phase, which will include demonstrations along with a brief overview of the two strategies.

The Filter Methodology

The Filter methodology uses the selected metric to identify irrelevant attributes and also filter out redundant columns from your models. It gives you the option of isolating selected measures that enrich your model. The columns are ranked following the calculation of the feature scores.

By choosing and implementing the right features, you can potentially improve the accuracy and efficiency of your classification models.

The Fast Correlation Based Filter (FCBF) algorithm, which we will be using in this demonstration, is a Filter method that uses the idea of “predominant correlation”. It selects features with high correlation with the target variable, but little correlation to other features.

Notably, the correlation used here is not the typical Pearson, Kendall, or Spearman you may be using commonly. It’s known as Symmetrical Uncertainty (SU), which is based on information theory, drawing from the concepts of Shannon Entropy and Information Gain.

The algorithm initially selects features correlated above a given threshold by SU with the class variable. After this initial filtering, it detects predominant correlations of features with the class. The definition is that for a predominant feature, for example, R, no other feature is more correlated to R than R is correlated to Y.

At Explorium, we have to deal with 2 main challenges. Firstly, we can’t test the correlation for every 2 features, since the number of features is growing constantly. Secondly, the original paper relies on a predefined threshold, which simply can’t be used for all use cases.

To solve that we first create groups for the features based on selected characters and define the feature with the highest score correlated to Y as the group benchmark at each iteration. We then add an algo_config parameter (multiplied by the group benchmark) to test different correlation levels.

In the feature selection phase, the Explorium platform runs the FCBF selector (with other strategies) several times, each with different algo_config parameters. This lets us run several tests of groups correlations in parallel. This helps us find the best features at scale for our customer’s use case.

Methodology

  1. Creates groups of the features as per different criteria.
  2. Creates a benchmark for each group.
  3. Tests the correlation of features inside the group, compared to the predetermined group benchmark.
  4. Keeps only the features that are less correlated to each other than to the group benchmark.

Let’s see a simplified version of FCBF selector in code:

The Wrapper Methodology

The Wrapper methodology considers the selection of feature sets as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign model performance scores.

For this demonstration, I’ve chosen to implement the Boruta algorithm, with XGBoost as our wrapper classifier. By doing so, we found it to be better in the performance and efficiency fronts. The classifier tries to capture all important and interesting features that may be hiding in your data set with respect to an outcome variable.

Methodology

  1. Creating duplicate features and shuffle their values in each column. These features are called shadow features.
  2. Trains a classifier (XGBoost) several times, on the Dataset and calculate the all feature importance at all iterations.
  3. For all the shadow features we create a benchmark based on the mean importance and algo-config parameter.
  4. Then, the algorithm checks for each of your real features if they have higher importance. That is, whether the feature has an importance greater than the benchmark. and keeps only the ones greater.

Let’s see a simplified version of Burota selector in code:

As part of the Explorium feature selection phase, our platform runs the Burota selector (amongst other strategies) several times, each with a different algo_config. This operation helps us run several statistics tests in parallel. We are then able to find at scale the best features for our customer use case.

The Comparison

Please refer to the table below to access the differences between the two methodologies, which offer different types of advantages.

FilterWrapper
SpeedFastSlow
How it worksCorrelation with targetWith other model
Evaluation methodsStatisticalModel feature importance
OverfittingMoreLess
Scaling capabilitiesBetterWorse
Finding best featuresWorseBetter

It’s important to clarify that there is no clear winner in this dual, since they both serve different use-cases. No matter which method you opt for eventually, you should be able to get better results, assuming your modelling is done right and you are working with the right data sources.

Subscribe Today! Get the latest updates with our newsletter.
We promise you'll love it.

Follow us

Share

This site uses cookies to provide you with a great browsing experience. By continuing on our website, you accept our cookies policy. I accept
Some text about the article in TechCrunch
Read more