...

/

Feature Selection (Filter Methods)

Feature Selection (Filter Methods)

In this lesson, you will learn about Feature Selection which refers to the process of selecting the most appropriate features for making the model.

Feature Selection

Feature or Variable Selection refers to the process of selecting features that are used in predicting the target or output. The purpose of Feature Selection is to select the features that contribute the most to output prediction. The following line from the abstract of Machine Learning Journal sums up the purpose of Feature Selection.

The objective of variable selection is three-fold: improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data.

Usually these benefits of Feature Selection are quoted :

  • Reduces overfitting: Overfitting has been explained in the previous lessons. If the model is overfitting, reducing the number of features is one of the methods to reduce overfitting.

  • Improves accuracy: Less overfitting would perform well on the unseen dataset, so it ultimately leads to the improved accuracy of the model.

  • Reduces training time: A smaller number of features means that there will be less data to train and training will be faster.

Feature Selection Methods are categorized into the following methods.

Filter Methods

The Filter Methods involve selecting features based on their various statistical scores with the output column. The selection of features is independent of any Machine Learning algorithm. The following rules of thumb are as follows.

  • The more the features are correlated with the output column or the column to be predicted, the better the performance of the model.

  • Features should be least correlated with each other. If some of the input features are correlated with some additional input features, this situation is known as Multicollinearity. We recommended getting rid of such a situation for better performance of the model.

Removing features with low variance

In Scikit Learn, VarianceThreshold is a simple baseline approach to Feature Selection. It removes all features whose variance doesn’t meet some threshold. This involves removing the features having the same value ...