How feature selection could actually harm your machine learning models when used incorrectly

Overfitting is probably one of the first things you’re taught to avoid as a data scientist. When you’re overfitting data, you’re basically creating a model that doesn’t generalize the learning of the training data.

The most common way to find out if your model is overfitting is testing it on unseen data or test data. The idea is simple, if the model is capable of generalizing the new data, it will perform almost the same on the test data as it would on the training data.

In this article I’ll tell you how I fell for one of the most basic pitfalls in training a model while thinking everything is performing well.

I work for a company called Explorium, which automates the process of enriching data to create superior machine learning models and derive more meaningful insights from external data. A year ago, we incorporated a new enrichment around websites in our platform which basically gives us a chunk of text about every small business in the states. I was given the task to play around with the new source of data and try different techniques to utilize new features that help customers build better, more robust ML models.

I wanted to guide my work through real machine learning use cases so I asked to help a client with his model while using the new enrichment. The objective was given a SMB in the states, try to predict whether it’s a real business or not.

To make it more practical, let’s open a Jupyter notebook and play with
some synthetic data:

import pandas as pd
import joblib
df = joblib.load('/Users/maorshlomo/blogs/text_dataset.pkl')[['text', 'label']].sample(1000 * 5)
df.head()
Out [62]: text label
15304 Home – Red Box … 0
18685 Forside MENU … 0
37008 1
20580 T&R Precision Engineering Ltd | Precision Eng… 0
15965 Chilfen Joinery | Quality Joinery Solutions … 0

What we have here is a label column and a text field, a simple text classification problem. Instead of going head first into complex models (BART, LSTMs, etc..), I wanted to first try a simpler model: Logistic Regression.

Let’s start with a TfIdfVectorizer to convert the text values into machine readable features:

from sklearn.feature_extraction.text import TfidfVectorizer
extractor = TfidfVectorizer(stop_words='english')
features = extractor.fit_transform(df.text.values)
print(features.shape)

(5000, 103611)

Oops. That’s a huge dataframe we have here. We might want to reduce the dimensionality later on, but for now let’s try and train the simple model we talked about.

As mentioned above, the way we will check our model’s performance is by measuring its accuracy on the test set. In order to make sure there’s no bias in a specific test set, we will actually use multiple test sets and training sets; basically we can use cross-validation.

import numpy as np
import warnings
warnings.filterwarnings("ignore")
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
labels = df['label'].values
scores = cross_val_score(LogisticRegression(), features, labels, scoring='roc_auc', cv=5)
print(f'average KFold AUC score is -- {np.mean(scores)}')

average KFold AUC score is: 0.6967367722573435

Nice initial results. Now let’s see where the mistake I made took place. The idea that guided me was simple- there might be too many features. Let’s try and pick the most informative n-grams to the problem we’re facing. I had 2 choices:

  • Unsupervised Dimensionality Reduction techniques (e.g. PCA)
  • Correlation Based Feature Selection (F-score threshold)

I wanted to try feature selection first; it seemed like it’s guided by the labels, therefore it would be better in choosing the informative words out of the other vocabulary.

So I looked at sklearn’s select K-Best feature selector –

from sklearn.feature_selection import SelectKBest, f_classif

for K_features in [100, 200, 1000, 2000, 4000, 10000, 15000, 20000, 40000, 50000, 100000, features.shape[1]]:
         selector = SelectKBest(f_classif, k=K_features)
         selected_features = selector.fit_transform(features, labels)
         # train the model on the new features:
         scores = cross_val_score(LogisticRegression(), selected_features, labels, scoring='roc_auc', cv=5)
         print(f'selected {K_features} features. average KFold AUC score is {round(np.mean(scores), 3)}')

selected 100 features. average KFold AUC score is 0.683
selected 200 features. average KFold AUC score is 0.689
selected 1000 features. average KFold AUC score is 0.738
selected 2000 features. average KFold AUC score is 0.751
selected 4000 features. average KFold AUC score is 0.765
selected 10000 features. average KFold AUC score is 0.774
selected 15000 features. average KFold AUC score is 0.774
selected 20000 features. average KFold AUC score is 0.774
selected 40000 features. average KFold AUC score is 0.774
selected 50000 features. average KFold AUC score is 0.771
selected 100000 features. average KFold AUC score is 0.715
selected 103611 features. average KFold AUC score is 0.697

Wow! The best configuration (10K features) yields an 8% AUC lift! That’s nice don’t you think? Not exactly.

We might think that by using cross-validation we’re actually getting the right results which means getting a good measurement of how much the model is capable of fitting and generalizing the data to new instances. Let’s try to manually use a test set and see what happens.

First, let’s split the data into training and test data:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

Let’s run our feature selector with the K_features we’ve found to be most effective:

K_features = 20000
selector = SelectKBest(f_classif, k=K_features)
X_train = selector.fit_transform(X_train, y_train)
X_test = selector.transform(X_test)
print(f"X_train shape: {X_train.shape}\nX_test shape: {X_test.shape}")

X_train shape: (4000, 20000)
X_test shape: (1000, 20000)

Let’s train the model on the training set:

cls = LogisticRegression()
cls.fit(X_train, y_train)

Out [72]: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=’warn’,
n_jobs=None, penalty=’l2′, random_state=None, solver=’warn’,
tol=0.0001, verbose=0, warm_start=False)

Now, let’s test the AUC score on the newly generated data:

from sklearn.metrics import roc_auc_score
roc_auc_score(y_score=cls.predict_proba(X_test)[:,1], y_true=y_test)

Out[73]: 0.7037234534910395

Hey! What happened? Why did the AUC score drop like that? That’s because we applied feature selection in the wrong way.

When we tested our feature selection algorithm using selector.fit_transform(features, labels) we actually selected the features based on whole dataset. That was the mistake.

Why? When we “trained” the feature selector on the complete dataset we actually made the model overfit the test set, not just the training set. Let me explain.

Imagine that there’s a word that appears only in the test set and is correlated with the positive label. Let’s mark that word feature vector with [latex]Xword[/latex]. When using the feature selector on all of the dataset, it will actually select that word (represented by a feature) because it’s correlated with the label even though it is not in the training set meaning the model shouldn’t learn anything from the training set about that word.

But a model with a bias might actually used that word without seeing it on a training set. That’s data leakage.

If the linear model’s equation is [latex]f(x)=−1+aX1+bX2+…+10∗Xword+…+zXn[/latex], then even without seeing [latex]Xword[/latex] in the training dataset, it will use information from the test set, thus optimizing and overfitting the test set. The score we will get out of the cross-validation is actually wrong and might get us to deploy an even worse model into production.

The way we should’ve done this is is like this:

Using sklearn’s built in pipeline, which makes sure the whole pipeline (feature selector and model) is trained and optimized only on the training set:

from sklearn.pipeline import Pipeline
def create_pipeline(k_features):
pipeline = Pipeline(
[('feature_selector', SelectKBest(f_classif, k_features)),
('classifier', LogisticRegression())])
return pipeline

for K_features in [100, 200, 1000, 2000, 4000, 10000, 15000, 20000, 40000, 50000, 100000, features.shape[1]]:
pipeline = create_pipeline(K_features)
scores = cross_val_score(pipeline, features, labels, scoring='roc_auc', cv=5)
print(f'selected {K_features} features. average KFold AUC score is {round(np.mean(scores), 3)}')

selected 100 features. average KFold AUC score is 0.657
selected 200 features. average KFold AUC score is 0.671
selected 1000 features. average KFold AUC score is 0.688
selected 2000 features. average KFold AUC score is 0.687
selected 4000 features. average KFold AUC score is 0.689
selected 10000 features. average KFold AUC score is 0.695
selected 15000 features. average KFold AUC score is 0.696
selected 20000 features. average KFold AUC score is 0.696
selected 40000 features. average KFold AUC score is 0.695
selected 50000 features. average KFold AUC score is 0.695
selected 100000 features. average KFold AUC score is 0.697
selected 103611 features. average KFold AUC score is 0.697

As we can see – there’s no actual lift when using correlation based feature selection. The AUC score stays 0.697. Which makes sense.

Linear models are quite simple (well, they’re linear), so overfitting the data is not common, unless there’s a label leakage.