Feature Selection

Feature selection is a crucial step in the Machine Learning task. In this lesson, we will see some common methods for feature selection.

What is feature selection?

In Machine Learning, feature selection is used to select a subset of relevant features (variables, predictors, etc.) for use in model construction. This is an important step in the process of a Machine Learning project, it’s also a part of feature engineering. This is important for the following reasons:

  • To reduce the training time. Training time and feature space are positively correlated.
  • To avoid the curse of dimensionality.
  • Make the model easier.
  • Improve the generalization and reduce the overfitting.
  • Reduce collinearity and enhance interpretability.

When you get a dataset (table-like data), every column is a feature, but not all columns are useful or relevant. It is better to spend some time on feature selection. The central premise when using a feature selection technique is that the data contains some features that are either redundant or irrelevant and can thus be removed without incurring much loss of information.

There are many ways to do feature selection. sklearn provides many functions to do it that we will cover in the following section.

Remove feature with low variance

What does it mean that the variance of a feature is zero? It means that this feature has only one value and all instances share the same value on this feature. In other words, this feature does not have any information and contributes nothing to the prediction of the target. Similarly, those features with low variance have little information about the target, we could remove them without reducing the performance of the model.

widget

sklearn provides VarianceThreshold to remove low variance features. Meanwhile, threshold allows you to control the variance threshold.

import sklearn.feature_selection as fs

# X is you feature matrix
var = fs.VarianceThreshold(threshold=0.2)
var.fit(X)
X_trans = var.transform(X)

You can try the code example below. As you can see, the first feature has one value that is different, so the first column is removed.

Press + to interact
import sklearn.feature_selection as fs
import numpy as np
X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1,
1]])
var = fs.VarianceThreshold(threshold=0.2)
var.fit(X)
X_trans = var.transform(X)
print("The original data")
print(X)
print("The processed data by variance threshold")
print(X_trans)
  • line 3 creates a matrix with the size of six rows and three columns.

  • A variance threshold object is created at line 6 from VarianceThreshold with the parameter threshold=0.2, whcih means that columns with a variance less than 0.2 will be removed.

  • You can compare the original matrix with the new one at line 12.

Select K-best features

sklearn provides a universal function SelectKBest which can select k best features based on some metric, you only need to provide a score function to define your metric. Luckily, sklearn provides some predefined score functions. The following are some predefined callable score functions.

  • f_classif: ANOVA F-value between label/feature for classification tasks.
  • mutual_info_classif: Mutual information for a discrete target.
  • chi2: Chi-squared stats of non-negative features for classification tasks.
  • f_regression: F-value between label/feature for regression tasks.
  • mutual_info_regression: Mutual information for a continuous target.
  • SelectFpr: Select features based on a false positive rate test.

The core idea here is to calculate some metrics between the target and each feature, sort them, and then select the K best features.

In the example below, we choose the f_classif as the metrics, and K is three.

import sklearn.datasets as datasets

X, y = datasets.make_classification(n_samples=300, n_features=10, n_informative=4)
# choose the f_classif as the metric and K is 3
bk = fs.SelectKBest(fs.f_classif, 3)
bk.fit(X, y)
X_trans = bk.transform(X)

An important question is how the performance of the model is affected by reducing the number of features. In the example below, let’s compare the performance of different logistic regressions with different K best features.

As you can see from the image below, the metric won’t change too much if only a few features are removed.

You can have a try with the code below. Just change the number of features when creating a new dataset, or change the K.

Press + to interact
import sklearn.feature_selection as fs
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
X, y = datasets.make_classification(n_samples=500,
n_features=20,
n_informative=8,
random_state=42)
f1_list = []
for k in range(1, 15):
bk = fs.SelectKBest(fs.f_classif, k)
bk.fit(X, y)
X_trans = bk.transform(X)
train_x, test_x, train_y, test_y = train_test_split(X_trans,
y,
test_size=0.2,
random_state=42)
lr = LogisticRegression()
lr.fit(train_x, train_y)
y_pred = lr.predict(test_x)
f1 = metrics.f1_score(test_y, y_pred)
f1_list.append(f1)
fig, axe = plt.subplots(dpi = 300)
axe.plot(range(1, 15), f1_list)
axe.set_xlabel("best k features")
axe.set_ylabel("F1-score")
fig.savefig("output/img.png")
plt.close(fig)
  • First, we create a classification dataset at line 8 using make_classification.

  • line 14 to line 26 is a loop for k in range(1, 15). In each iteration of this loop, a different value of K would be passed to SelectKBest. We want to see how different values of K affect the performance of the model. A logistic regression model is built, fit, and evaluated in each iteration of the loop (from line 22 to line 25) using the K selected features. The metric is stored in a list, f1_list. In this demo, we use the f1-score as our metric.

  • From line 28 to line 33, we plot those Ks and their corresponding f1-score.

Select feature by other model

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. However, we just want to focus on the tree-based model. You may remember that the tree is split by some metric on a single feature. Based on this metric, you may know the importance of different features. This is a property of the tree model; so through the tree model, we know the different contributions of different features to the model.

Notice: The model(GBDT) mentioned here will be discussed in subsequent lectures.

sklearn provides SelectFromModel to do the feature selection. From the code below, you may notice the first parameter gb. It’s a GBDT model which is used to select features by using feature_importances_. Tree models are great for feature selection.

import sklearn.feature_selection as fs

model = fs.SelectFromModel(gb, prefit=True)
# X is your feature matrix, X_trans is the new feature matrix.
X_trans = model.transform(X)
Press + to interact
import sklearn.feature_selection as fs
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import sklearn.metrics as metrics
X, y = datasets.make_classification(n_samples=500,
n_features=20,
n_informative=6,
random_state=21)
gb = GradientBoostingClassifier(n_estimators=20)
gb.fit(X, y)
print("The feature importances of GBDT")
print(gb.feature_importances_)
model = fs.SelectFromModel(gb, prefit=True)
X_trans = model.transform(X)
print("The shape of original data is {}".format(X.shape))
print("The shape of transformed data is {}".format(X_trans.shape))
  • A dataset is created at line 7.

  • Then a GBDT object is created at line 12 from GradientBoostingClassifier and fit at line 13.

  • The output of line 15 shows the importance of different features; the larger the number, the higher the importance.

  • line 17 shows how to use another model to select a feature by SelectFromModel. All you have to do is pass the GBDT object. The prefit=True means that this model has been fit already.

We recommend you launch the widget to open the Jupyter file below, which contains more content and interactive operations.

Please login to launch live app!