Hands-on Machine Learning with Scikit-Learn/

...

Feature Selection

Feature selection is a crucial step in the Machine Learning task. In this lesson, we will see some common methods for feature selection.

We'll cover the following...

- What is feature selection?
- Remove feature with low variance
- Select K-best features
- Select feature by other model

What is feature selection?

In Machine Learning, feature selection is used to select a subset of relevant features (variables, predictors, etc.) for use in model construction. This is an important step in the process of a Machine Learning project, it’s also a part of feature engineering. This is important for the following reasons:

To reduce the training time. Training time and feature space are positively correlated.
To avoid the curse of dimensionality.
Make the model easier.
Improve the generalization and reduce the overfitting.
Reduce collinearity and enhance interpretability.

When you get a dataset (table-like data), every column is a feature, but not all columns are useful or relevant. It is better to spend some time on feature selection. The central premise when using a feature selection technique is that the data contains some features that are either redundant or irrelevant and can thus be removed without incurring much loss of information.

There are many ways to do feature selection. sklearn provides many functions to do it that we will cover in the following section.

sklearn provides VarianceThreshold to remove low variance features. Meanwhile, threshold allows you to control the variance threshold.

import sklearn.feature_selection as fs

# X is you feature matrix
var = fs.VarianceThreshold(threshold=0.2)
var.fit(X)
X_trans = var.transform(X)

You can try the code example below. As you can see, the first feature has one value that is different, so the first column is removed.

Press + to interact

Select K-best features

sklearn provides a universal function SelectKBest which can select k best features based on some metric, you only need to provide a score function to define your metric. Luckily, sklearn provides some predefined score functions. The following are some predefined callable score functions.

f_classif: ANOVA F-value between label/feature for classification tasks.
mutual_info_classif: Mutual information for a discrete target.
chi2: Chi-squared stats of non-negative features for classification tasks.
f_regression: F-value between label/feature for regression tasks.
mutual_info_regression: Mutual information for a continuous target.
SelectFpr: Select features based on a false positive rate test.

The core idea here is to calculate some metrics between the target and each feature, sort them, and then select the K best features.

In the example below, we choose the f_classif as the metrics, and K is three.

import sklearn.datasets as datasets

X, y = datasets.make_classification(n_samples=300, n_features=10, n_informative=4)
# choose the f_classif as the metric and K is 3
bk = fs.SelectKBest(fs.f_classif, 3)
bk.fit(X, y)
X_trans = bk.transform(X)

An important question is how the performance of the model is affected by reducing the number of features. In the example below, let’s compare the performance of different logistic regressions with different K best features.

As you can see from the image below, the metric won’t change too much if only a few features are removed.

Press + to interact

Python 3.5

import sklearn.feature_selection as fs
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics
import matplotlib.pyplot as plt
X, y = datasets.make_classification(n_samples=500,
                                    n_features=20,
                                    n_informative=8,
                                    random_state=42)
f1_list = []
for k in range(1, 15):
    bk = fs.SelectKBest(fs.f_classif, k)
    bk.fit(X, y)
    X_trans = bk.transform(X)
    train_x, test_x, train_y, test_y = train_test_split(X_trans,
                                                        y,
                                                        test_size=0.2,
                                                        random_state=42)
    lr = LogisticRegression()
    lr.fit(train_x, train_y)
    y_pred = lr.predict(test_x)
    f1 = metrics.f1_score(test_y, y_pred)
    f1_list.append(f1)
fig, axe = plt.subplots(dpi = 300)
axe.plot(range(1, 15), f1_list)
axe.set_xlabel("best k features")
axe.set_ylabel("F1-score")
fig.savefig("output/img.png")
plt.close(fig)

First, we create a classification dataset at line 8 using make_classification.
line 14 to line 26 is a loop for k in range(1, 15). In each iteration of this loop, a different value of K would be passed to SelectKBest. We want to see how different values of K affect the performance of the model. A logistic regression model is built, fit, and evaluated in each iteration of the loop (from line 22 to line 25) using the K selected features. The metric is stored in a list, f1_list. In this demo, we use the f1-score as our metric.
From line 28 to line 33, we plot those Ks and their corresponding f1-score.

Select feature by other model

SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting. However, we just want to focus on the tree-based model. You may remember that the tree is split by some metric on a single feature. Based on this metric, you may know the importance of different features. This is a property of the tree model; so through the tree model, we know the different contributions of different features to the model.

Notice: The model(GBDT) mentioned here will be discussed in subsequent lectures.

sklearn provides SelectFromModel to do the feature selection. From the code below, you may notice the first parameter gb. It’s a GBDT model which is used to select features by using feature_importances_. Tree models are great for feature selection.

import sklearn.feature_selection as fs

model = fs.SelectFromModel(gb, prefit=True)
# X is your feature matrix, X_trans is the new feature matrix.
X_trans = model.transform(X)

Press + to interact

Python 3.5

import sklearn.feature_selection as fs
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import sklearn.metrics as metrics
X, y = datasets.make_classification(n_samples=500,
                                    n_features=20,
                                    n_informative=6,
                                    random_state=21)
gb = GradientBoostingClassifier(n_estimators=20)
gb.fit(X, y)
print("The feature importances of GBDT")
print(gb.feature_importances_)
model = fs.SelectFromModel(gb, prefit=True)
X_trans = model.transform(X)
print("The shape of original data is {}".format(X.shape))
print("The shape of transformed data is {}".format(X_trans.shape))

A dataset is created at line 7.
Then a GBDT object is created at line 12 from GradientBoostingClassifier and fit at line 13.
The output of line 15 shows the importance of different features; the larger the number, the higher the importance.
line 17 shows how to use another model to select a feature by SelectFromModel. All you have to do is pass the GBDT object. The prefit=True means that this model has been fit already.

We recommend you launch the widget to open the Jupyter file below, which contains more content and interactive operations.

Preliminaries

Working with Datasets

Feature Engineering

General Concepts

Linear Regression

Logistic Regression

Support Vector Machine

Tree Model and Ensemble Method

Unsupervised Learning

Deep Learning

Others

What's Next

Feature Selection

What is feature selection?

Remove feature with low variance

Select K-best features

Select feature by other model