Feature selection in Python

Feature selection is the process of determining the relevant features that affect the target variable and need to be included in the model. A feature can be as simple as a column in the dataset. It plays a role in describing the entries in a dataset.

Not every feature present in the dataset has to have an effect on the target variable as the presence of such features adversely affect the model.

The most common method of feature selection in data science is manual filtering; this approach is mostly applied to numeric data points.

Manual filtering

As the name suggests, the irrelevant features that do not affect the target variable are filtered out.

Irrelevant features are determined through a correlation matrix.

The correlations can be displayed in a heatmap.

  • A value closer to +1 and -1 implies a strong positive and a strong negative correlation, respectively.

  • A value closer to 0 implies a very weak correlation.

The correlation heatmap can be plotted as shown below:

cor = df.corr()
sns.heatmap(cor, annot=True)
plt.show()

A threshold, tt, has to be decided for the value of correlation; if the absolute of the correlation value of a feature and target value, v|v|, is less than tt, then that feature is filtered out.

You should keep an eye on features that are highly correlated with each other as only one of those features can stay in the selected features.

The obtained features can now be used to build the model.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved