Feature Selection
Feature selection is a crucial step in the Machine Learning task. In this lesson, we will see some common methods for feature selection.
We'll cover the following
What is feature selection?
In Machine Learning, feature selection
is used to select a subset of relevant features (variables, predictors, etc.) for use in model construction. This is an important step in the process of a Machine Learning project, it’s also a part of feature engineering
. This is important for the following reasons:
- To reduce the training time. Training time and feature space are positively correlated.
- To avoid the curse of dimensionality.
- Make the model easier.
- Improve the generalization and reduce the overfitting.
- Reduce collinearity and enhance interpretability.
When you get a dataset (table-like data), every column is a feature, but not all columns are useful or relevant. It is better to spend some time on feature selection. The central premise when using a feature selection technique is that the data contains some features that are either redundant or irrelevant and can thus be removed without incurring much loss of information.
There are many ways to do feature selection. sklearn
provides many functions to do it that we will cover in the following section.
Remove feature with low variance
What does it mean that the variance of a feature is zero? It means that this feature has only one value and all instances share the same value on this feature. In other words, this feature does not have any information and contributes nothing to the prediction of the target. Similarly, those features with low variance have little information about the target, we could remove them without reducing the performance of the model.
sklearn
provides VarianceThreshold
to remove low variance features. Meanwhile, threshold
allows you to control the variance threshold.
import sklearn.feature_selection as fs
# X is you feature matrix
var = fs.VarianceThreshold(threshold=0.2)
var.fit(X)
X_trans = var.transform(X)
You can try the code example below. As you can see, the first feature has one value that is different, so the first column is removed.
import sklearn.feature_selection as fsimport numpy as npX = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1,1]])var = fs.VarianceThreshold(threshold=0.2)var.fit(X)X_trans = var.transform(X)print("The original data")print(X)print("The processed data by variance threshold")print(X_trans)
-
line 3
creates a matrix with the size of six rows and three columns. -
A variance threshold object is created at
line 6
fromVarianceThreshold
with the parameterthreshold=0.2
, whcih means that columns with a variance less than 0.2 will be removed. -
You can compare the original matrix with the new one at
line 12
.
Select K-best features
sklearn
provides a universal function SelectKBest
which can select k
best features based on some metric, you only need to provide a score function to define your metric. Luckily, sklearn
provides some predefined score functions. The following are some predefined callable score functions.
f_classif
: ANOVA F-value between label/feature for classification tasks.mutual_info_classif
: Mutual information for a discrete target.chi2
: Chi-squared stats of non-negative features for classification tasks.f_regression
: F-value between label/feature for regression tasks.mutual_info_regression
: Mutual information for a continuous target.SelectFpr
: Select features based on a false positive rate test.
The core idea here is to calculate some metrics between the target and each feature, sort them, and then select the K
best features.
In the example below, we choose the f_classif
as the metrics, and K
is three.
import sklearn.datasets as datasets
X, y = datasets.make_classification(n_samples=300, n_features=10, n_informative=4)
# choose the f_classif as the metric and K is 3
bk = fs.SelectKBest(fs.f_classif, 3)
bk.fit(X, y)
X_trans = bk.transform(X)
An important question is how the performance of the model is affected by reducing the number of features. In the example below, let’s compare the performance of different logistic regressions with different K
best features.
As you can see from the image below, the metric won’t change too much if only a few features are removed.
You can have a try with the code below. Just change the number of features when creating a new dataset, or change the K
.
import sklearn.feature_selection as fsimport sklearn.datasets as datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionimport sklearn.metrics as metricsimport matplotlib.pyplot as pltX, y = datasets.make_classification(n_samples=500,n_features=20,n_informative=8,random_state=42)f1_list = []for k in range(1, 15):bk = fs.SelectKBest(fs.f_classif, k)bk.fit(X, y)X_trans = bk.transform(X)train_x, test_x, train_y, test_y = train_test_split(X_trans,y,test_size=0.2,random_state=42)lr = LogisticRegression()lr.fit(train_x, train_y)y_pred = lr.predict(test_x)f1 = metrics.f1_score(test_y, y_pred)f1_list.append(f1)fig, axe = plt.subplots(dpi = 300)axe.plot(range(1, 15), f1_list)axe.set_xlabel("best k features")axe.set_ylabel("F1-score")fig.savefig("output/img.png")plt.close(fig)
-
First, we create a classification dataset at
line 8
usingmake_classification
. -
line 14
toline 26
is a loopfor k in range(1, 15)
. In each iteration of this loop, a different value ofK
would be passed toSelectKBest
. We want to see how different values ofK
affect the performance of the model. A logistic regression model is built, fit, and evaluated in each iteration of the loop (fromline 22
toline 25
) using theK
selected features. The metric is stored in a list,f1_list
. In this demo, we use thef1-score
as our metric. -
From
line 28
toline 33
, we plot thoseK
s and their corresponding f1-score.
Select feature by other model
SelectFromModel
is a meta-transformer that can be used along with any estimator that has a coef_
or feature_importances_
attribute after fitting. However, we just want to focus on the tree-based model. You may remember that the tree is split by some metric on a single feature. Based on this metric, you may know the importance of different features. This is a property of the tree model; so through the tree model, we know the different contributions of different features to the model.
Notice: The model(
GBDT
) mentioned here will be discussed in subsequent lectures.
sklearn
provides SelectFromModel
to do the feature selection. From the code below, you may notice the first parameter gb
. It’s a GBDT
model which is used to select features by using feature_importances_
. Tree models are great for feature selection.
import sklearn.feature_selection as fs
model = fs.SelectFromModel(gb, prefit=True)
# X is your feature matrix, X_trans is the new feature matrix.
X_trans = model.transform(X)
import sklearn.feature_selection as fsimport sklearn.datasets as datasetsfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import GradientBoostingClassifierimport sklearn.metrics as metricsX, y = datasets.make_classification(n_samples=500,n_features=20,n_informative=6,random_state=21)gb = GradientBoostingClassifier(n_estimators=20)gb.fit(X, y)print("The feature importances of GBDT")print(gb.feature_importances_)model = fs.SelectFromModel(gb, prefit=True)X_trans = model.transform(X)print("The shape of original data is {}".format(X.shape))print("The shape of transformed data is {}".format(X_trans.shape))
-
A dataset is created at
line 7
. -
Then a
GBDT
object is created atline 12
fromGradientBoostingClassifier
andfit
atline 13
. -
The output of
line 15
shows the importance of different features; the larger the number, the higher the importance. -
line 17
shows how to use another model to select a feature bySelectFromModel
. All you have to do is pass theGBDT
object. Theprefit=True
means that this model has been fit already.
We recommend you launch the widget to open the Jupyter
file below, which contains more content and interactive operations.