Feature Selection
Go through the feature selection process, considering different methods.
Before we move forward, dropping all the error columns from our data frame might be a good idea. They may not be beneficial. At the same time, we can be innovative and look for the column name to see if they have the word “error” using the for loop and make a selection.
cols_ = []for item in df.columns:if 'error' not in item:cols_.append(item)print(cols_)
So, we have a list of columns, cols_
, that do not include the error columns. We can separate them now.
df = df[cols_] # separating columns we want to work withprint(df)
Let’s explore this data a bit more. Feature selection is necessary, especially when working with many features in our dataset.
Feature selection
Let’s introduce some common ways to select features based on statistical measures. Analysis of variance (ANOVA) and chi-square (chi2) are recommended for feature selection in classification problems. Before we move on, we need to import SelectKBest()
, which will return the requested number of top features based on suggested statistics, such as chi2 or ANOVA f-value.
# Let's import the required modules firstfrom sklearn.feature_selection import SelectKBest, chi2, f_classif# let's separate feature and target, we can use X and y notations as well!# notice a different way of getting our purpose done! one line of code.features, target = df.drop(['ID', 'diagnosis','target'], axis=1), df['target']print('Shape of features:', features.shape)print('Shape of target:', target.shape)
In scikit-learn, chi2
computes ...