...

/

Feature Selection

Feature Selection

Go through the feature selection process, considering different methods.

Before we move forward, dropping all the error columns from our data frame might be a good idea. They may not be beneficial. At the same time, we can be innovative and look for the column name to see if they have the word “error” using the for loop and make a selection.

Press + to interact
cols_ = []
for item in df.columns:
if 'error' not in item:
cols_.append(item)
print(cols_)

So, we have a list of columns, cols_, that do not include the error columns. We can separate them now.

Press + to interact
df = df[cols_] # separating columns we want to work with
print(df)

Let’s explore this data a bit more. Feature selection is necessary, especially when working with many features in our dataset.

Feature selection

Let’s introduce some common ways to select features based on statistical measures. Analysis of variance (ANOVA) and chi-square (chi2) are recommended for feature selection in classification problems. Before we move on, we need to import SelectKBest(), which will return the requested number of top features based on suggested statistics, such as chi2 or ANOVA f-value.

Press + to interact
# Let's import the required modules first
from sklearn.feature_selection import SelectKBest, chi2, f_classif
# let's separate feature and target, we can use X and y notations as well!
# notice a different way of getting our purpose done! one line of code.
features, target = df.drop(['ID', 'diagnosis','target'], axis=1), df['target']
print('Shape of features:', features.shape)
print('Shape of target:', target.shape)

In scikit-learn, chi2 computes ...

Access this course and 1400+ top-rated courses and projects.