One of the best ways to improve your production machine learning models is to improve the data that your model is trained on. Simple, right? Not when trying to do it at scale with the curse of dimensionality — various anomalies that begin to appear that wouldn’t naturally appear in low-dimensional settings — haunting you.
The go-to move for every data scientist is applying feature selection strategies before training their models with the existing data. But, what are the best ways to evaluate feature selection strategies so that, over a period of time, you’ll have high performance and high interpretability?
As it turns out, some feature selection strategies may perform well when created but tend to break when tested after a while, which means that some of the features that get selected are unstable and might perform badly on new data.
As a data scientist, whenever you take the long road to add external sources or expand your internal data with new features, a feature selection process is a must. It not only reduces your training time and reduces your model complexity, but it eventually helps to prevent overfitting. Therefore, you should also consider what might be the downfalls in the game by looking at the stability of the selected features in order to increase the success of your model over a long period of time.
The regular flow of feature selection follows the above diagram from left to right. First, you take several data sources and generate the base feature table, then you apply several feature selection strategies to get several feature sets. Then, on every feature set you train and tune several machine learning models. When you move to the production phase, you select the flow that generates the best model score (in orange above).
What is the best way to ensure feature stability?
In short, you should apply feature selection but be sure to select the features that produce the optimal score (performance-wise) and have high feature stability. But before diving into techniques for calculating a feature stability score, let’s do a quick recap on feature selection to get you started.
These are methods that look at the properties of the features and measure their relevance via univariate statistic tests and select features regardless of the model. A good example is sklearn f-score for both regression and classification problems.
These methods measure how useful a feature is in regards to a specific classifier based on its performance and select features based on feature importance. A good example is the Boruta algorithm, which you can see in action here.
Embedded methods are quite similar to wrapper methods in that they both train a classifier and take into account features based on the classifier feature importance. The difference form wrapper methods, however, is that embedded methods use an intrinsic metric during the learning process. A good example of this is using Lasso regression, which can modify the regularization parameter during training.
Feature stability, according to Stability of feature selection algorithm: A review “indicates the reproducibility power of the feature selection method.”
We can say that feature has high stability if multiple feature selection processes “agree” that this feature is important and low stability when multiple feature selection processes “disagree” that this feature is important.
Let’s walk through some examples. Say we have a base of five features: [f1, f2, f3, f4, f5] and three different feature selection (fs) strategies [fs1, fs2, fs3] the “agreement” can be in one of several ways:
1. The feature existence/index (in the final list) will assign 1 = exist or 0 = doesn't exist. So, for the sake of this example, let’s say that after the feature selection phase the results of our five features are:
[1, 1, 0, 0, 1] for fs1
[1, 0, 1, 0, 1] for fs2
[1, 1, 0, 1, 0] for fs3
In this case, f1 exists in all feature selection strategies so it is the most stable.
2. The feature rank (position in the list) ranks features from left (highest rank) to right (lowest rank). So, after the feature selection phase the results of our five features, for the sake of this example, are:
[f1, f2, f5, f4, f3] for fs1
[f1, f5, f4, f2, f3] for fs2
[f1, f4, f2, f5, f3] for fs3
In this case, f1 is in the highest position in all of the feature selection strategies so it is most stable.
3. The feature weight/score ranks features from high values (important) to low values (unimportant). So, again in our example, after the feature selection phase, the results of our five features are:
[10, 0.3, 0.7, 5, 2] for fs1
[11, 5, 4, 1, 0.1] for fs2
[12, 2, 6, 4, 5] for fs3
In this case, f1 has the closest score across all feature selection strategies so it is most stable.
The diagram above runs through the feature selection process, only this time with feature stability included in the flow. From left to right, the process starts the same, but when going into the production phase you select the flow that generates the best model score and feature stability score (orange). Sometimes, there may need to be a tradeoff between the two, which we will elaborate on later.
The way to select the flow that generates the best model score and feature stability flow is by defining stability measurements and measure each feature subset (S) from a given feature selection algorithm. Then, add the measurement score of the subset to the final optimization process. Let’s give an example:
If all our features are [f1, f2, f3, f4, f5] and we need to choose the top three features, then the feature selection algorithm fs1 might give the output S1 = [f5, f2, f3] and the feature selection algorithm fs2 give the output S2 = [f2, f5, f1].
Then, you’ll run two processes over the selected subsets [S1, S2]. First, you’ll train and tune several machine learning models (after a cross-validation) and get the best model performance [M1(S1), M2(S2)] for every subset (M can be AUC/R2/F1….). Moreover, you’ll compute the stability score based on Stability Measurements (SM) on the feature subset and get [SM(S1), SM(S2)]. Finally, you’ll perform the optimization process that maintains both model performance and stability.
Let’s review some stability measurements:
1. Stability measurement by feature existence/index
# for first feature list (from first feature selection strategy) # and second feature list (from second feature selection strategy) # You can use skleran : from sklearn.metrics import jaccard_score jaccard_score(first_feature_list, second_feature_list) # Or compute it directly: def jaccard_similarity(first_feature_list, second_feature_list): features_intersection = len(list(set(first_feature_list).intersection(second_feature_list))) features_union = (len(first_feature_list) + len(second_feature_list)) - intersection return float(features_intersection) / features_union
# for first feature list (from first feature selection strategy) # and second feature list (from second feature selection strategy) # You can use skleran and scipy: from sklearn.metrics import mutual_info_score from scipy.stats import entropy def symmetrical_uncertainty(first_feature_list, second_feature_list): return 2.0 * mutual_information(first_feature_list, second_feature_list)/(entropy(first_feature_list) + entropy(second_feature_list))
2. Stability measurement by feature rank
# for ordered first feature list (from first feature selection strategy, order by score) # and ordered second feature list (from second feature selection strategy, order by score) from scipy.stats import spearmanr def spearman_correlation(order_first_feature_list, order_second_feature_list): correlation_score, p_value = spearmanr(order_first_feature_list, order_second_feature_list) return correlation_score
# for ordered first feature list (from first feature selection strategy, order by score) # and ordered second feature list (from second feature selection strategy, order by score) # You can use scipy: from scipy.spatial import distance distance.canberra(order_first_feature_list, order_second_feature_list)
3. Stability measurement by feature wight/score
# for wighted first feature list (from first feature selection strategy, list of wights/scores) # and wighted second feature list (from second feature selection strategy, list of wights/scores) from scipy.stats import pearsonr def pearson_correlation(wighted_first_feature_list, wighted_second_feature_list): correlation_score, p_value = pearsonr(wighted_first_feature_list, wighted_second_feature_list) return correlation_score
It should come as no surprise that there is no clear winner in this contest. Moreover, if you try to compute several stability measures you might find the results unstable. This means you might get different values of stability scores, which may actually confuse you more than help you decide the best feature set to eventually select. So, like any other machine learning problem, it takes some experience, judgement, and common sense.
It might be common to get the following graph when plotting different feature selection strategies (X-axis) and score (Y-axis) of model performance versus stability.
In this case, when you try to make the final decision before going to production with your models try to involve more people from the business side in the conversation to understand the implications of this tradeoff.
Doing feature selection right is no easy task by itself. Adding the dimension of feature stability on top of it doesn’t make it any easier. However, it is worth paying attention not only to the different types of strategies that provide the best model score, but also the features that have a strong stability score. I’ve seen more than one project that performs monitoring of post-prediction processes to evaluate how the features that got selected perform over a long period of time. Adding process stability measurements over time might shed more light big picture and give a better feedback loop to perform a better feature selection on your next iteration.