Back

Demystifying Feature Selection: Filter vs Wrapper Methods

August 11, 2019 Aviv Nutovitz Data Science

As data sources around us multiply exponentially (both in volume and variety), data science teams have the potential to generate more and more features for their organizations.

For example, building a stock prediction system a couple of years ago would probably mean relying primarily on revenue related features. But now we can grab large amounts of accurate online information – employee growth rates, online reviews, web traffic, search, C-level profile, and more.

So, given the fact that more and more features are becoming available for machine learning projects, feature selection algorithms are increasingly growing in significance. In this article, we will cover (and compare) two popular feature selection methodologies – Filter and Wrapper.

The Problem

At Explorium, we have teams working around the clock to find the best data possible to improve our customers’ prediction models. But with the number of possible accessible features for the models growing exponentially, this has become a difficult task. My team is responsible for locating algorithms and feature selection strategies. I would like to share some of my experiences.

In order to examine the two feature selection methodologies, let’s take a look at a small sample of our Melbourne prices dataset. After running the Kaggle data set on the Explorium platform, we can categorize all table columns into 3 groups. Let’s lock into a small subset of them. The orange colored columns are features from the original dataset.

The Data

The blue color columns are enriched features generated automatically from our thousands of data sources and the green column is the original target. Let’s focus on 3 features and rename them for easier understanding. Let’s use the following feature naming scheme:

  • “Rooms” -> R
  • “median_monthly_mortgage_repayments” -> M
  • “average_school_ratings_in_the_area” -> N and _TARGET_ -> Y

At this point, all the generated features will be clean and normalized, before being thrown into the feature selection phase. Explorium’s platform will then run several machine learning models and output the best one of the selected features. The feature selection process runs several strategies and selected parameters that will be tested constantly.

In the following sections, these parameters will be referred to as algo_config for simplicity reasons. Furthermore, the algo_config is used to mimic statistical tests in several confidence levels.

This blog post will focus on the feature selection phase, which will include demonstrations along with a brief overview of the two strategies.

The Filter Methodology

The Filter methodology uses the selected metric to identify irrelevant attributes and also filter out redundant columns from your models. It gives you the option of isolating selected measures that enrich your model. The columns are ranked following the calculation of the feature scores.

By choosing and implementing the right features, you can potentially improve the accuracy and efficiency of your classification models.

The Fast Correlation Based Filter (FCBF) algorithm, which we will be using in this demonstration, is a Filter method that uses the idea of “predominant correlation”. It selects features with high correlation with the target variable, but little correlation to other features.

Notably, the correlation used here is not the typical Pearson, Kendall, or Spearman you may be using commonly. It’s known as Symmetrical Uncertainty (SU), which is based on information theory, drawing from the concepts of Shannon Entropy and Information Gain.

The algorithm initially selects features correlated above a given threshold by SU with the class variable. After this initial filtering, it detects predominant correlations of features with the class. The definition is that for a predominant feature, for example, R, no other feature is more correlated to R than R is correlated to Y.

At Explorium, we have to deal with 2 main challenges. Firstly, we can’t test the correlation for every 2 features, since the number of features is growing constantly. Secondly, the original paper relies on a predefined threshold, which simply can’t be used for all use cases.

To solve that we first create groups for the features based on selected characters and define the feature with the highest score correlated to Y as the group benchmark at each iteration. We then add an algo_config parameter (multiplied by the group benchmark) to test different correlation levels.

In the feature selection phase, the Explorium platform runs the FCBF selector (with other strategies) several times, each with different algo_config parameters. This lets us run several tests of groups correlations in parallel. This helps us find the best features at scale for our customer’s use case.

Methodology

  1. Creates groups of the features as per different criteria.
  2. Creates a benchmark for each group.
  3. Tests the correlation of features inside the group, compared to the predetermined group benchmark.
  4. Keeps only the features that are less correlated to each other than to the group benchmark.

Let’s see a simplified version of FCBF selector in code:

class FCBFSelector():

    def fcbf_loop(self, features_groups):
        """
        this function is the main loop of the algorithm
        :param features_groups: list of groups of features
        :return: list of selected features
        """
        selected_features = []

        for group in features_groups:

            while len(group):
                # find primary feature best on score attribute and the rest of the group
                primary_feature, rest_of_the_features = self.find_primary(group)

                # perform selection: set the group to be only the remaining features
                group = self.search_and_drop_against_primary_feature(primary_feature, rest_of_the_features)

                # append primary to selected features
                selected_features.append(primary_feature)

        return selected_features

    def search_and_drop_against_primary_feature(self, primary_feature, rest_of_the_features, algo_config):
        """
        this function takes a group of features and a primary feature and drop all features in the group
        that are highly correlated to the primary feature
        :param primary_feature: the primary feature to compare to
        :param rest_of_the_features: all other features
        :param algo_config: predefined constant
        :return: rest of features to keep
        """

        features_to_keep = []
        scores = self.calc_su(rest_of_the_features, primary_feature)
        
        # keep the feature that his score is smaller then the group benchmark
        for feature, score in zip(rest_of_the_features, scores):
            if score * algo_config < primary_feature.score:
                features_to_keep.append(feature)

        return features_to_keep

    def calc_su(self, rest_of_the_features, primary_feature):
        # use built in entropy function for the calculation 
        y_entropy = entropy(primary_feature)
        x_entropies = [entropy(column) for column in rest_of_the_features]
        
        return np.array([2 * (score) / (x_entropy + y_entropy) for x_entropy, score in zip(x_entropies, rest_of_the_features.scores)])

The Wrapper Methodology

The Wrapper methodology considers the selection of feature sets as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model is used to evaluate a combination of features and assign model performance scores.

For this demonstration, I’ve chosen to implement the Boruta algorithm, with XGBoost as our wrapper classifier. By doing so, we found it to be better in the performance and efficiency fronts. The classifier tries to capture all important and interesting features that may be hiding in your data set with respect to an outcome variable.

Methodology

  1. Creating duplicate features and shuffle their values in each column. These features are called shadow features.
  2. Trains a classifier (XGBoost) several times, on the Dataset and calculate the all feature importance at all iterations.
  3. For all the shadow features we create a benchmark based on the mean importance and algo-config parameter.
  4. Then, the algorithm checks for each of your real features if they have higher importance. That is, whether the feature has an importance greater than the benchmark. and keeps only the ones greater.

Let’s see a simplified version of Burota selector in code:

import pandas as pd
import xgboost as xgb

class BorutaSelector():

    def find_best_features(self, feature_matrix, labels, algo_config):
        """
        :param feature_matrix: Input dataframe - X
        :param labels: Target variable
        :param algo_config: predefined constant
        """
        return_features_df = pd.DataFrame(columns=['feature', 'importance'])
        # run several iterations of XGBoost to get the average feature importance of several runs
        for iteration in range(1, 11):

            # Create the shadow variables and run the model to obtain importance
            new_feature_matrix, shadow_names = self.create_shadow_features(feature_matrix)

            # DMatrix is a internal data structure that used by XGBoost which
            # is optimized for both memory efficiency and training speed
            dtrain = xgb.DMatrix(new_feature_matrix, label=labels)

            # train xgb
            xgb_selector = xgb.train(dtrain)

            # get sorted feature importance from the xgb by values of their importance
            importance = sorted(xgb_selector.get_score().items(), reverse=True)

            # marge back the df
            iteration_features_df = pd.DataFrame(importance, columns=['feature', 'importance' + str(iteration)])
            return_features_df = pd.merge(return_features_df, iteration_features_df, on='feature', how='outer')

        # Mean - is the mean importance score over 10 runs
        return_features_df['Mean'] = return_features_df.mean(axis=1)

        real_vars = return_features_df[~return_features_df['feature'].isin(shadow_names)]
        shadow_vars = return_features_df[return_features_df['feature'].isin(shadow_names)]

        # return only the features that their importance mean score is higher then the mean_shadow
        mean_shadow = shadow_vars['Mean'] * algo_config
        return_features = real_vars[(real_vars.Mean > mean_shadow)]

        return return_features

    def create_shadow_features(self, feature_matrix_train):
        """
        Take all feature_matrix variables, creating copies and randomly shuffling them
        :param feature_matrix_train: the dataframe to create shadow features on
        :return: dataframe 2x width and the names of the shadows for removing later
        """
        feature_matrix_shadow = feature_matrix_train.copy()
        for column_name in feature_matrix_shadow.columns:
            np.random.shuffle(feature_matrix_shadow[column_name].values)
        # rename the shadow
        shadow_names = [str(i + 1)+ "_shadow" for i in range(feature_matrix_train.shape[1])]
        feature_matrix_shadow.columns = shadow_names
        # Combine to make one new dataframe
        new_feature_matrix = pd.concat([feature_matrix_train, feature_matrix_shadow], axis=1)

        return new_feature_matrix, shadow_names

As part of the Explorium feature selection phase, our platform runs the Burota selector (amongst other strategies) several times, each with a different algo_config. This operation helps us run several statistics tests in parallel. We are then able to find at scale the best features for our customer use case.

The Comparison

Please refer to the table below to access the differences between the two methodologies, which offer different types of advantages.

FilterWrapper
SpeedFastSlow
How it worksCorrelation with targetWith other model
Evaluation methodsStatisticalModel feature importance
OverfittingMoreLess
Scaling capabilitiesBetterWorse
Finding best featuresWorseBetter

It’s important to clarify that there is no clear winner in this dual, since they both serve different use-cases. No matter which method you opt for eventually, you should be able to get better results, assuming your modelling is done right and you are working with the right data sources.

Subscribe Today! Get the latest updates with our newsletter.
We promise you'll love it.

Follow us

Share

Just announced! Explorium Announces $31M in Series B Funding to Accelerate Growth Read more