Exercise: F-test and Univariate Feature Selection
Learn how to select the univariate features using the F-test.
We'll cover the following
Univariate feature selection using F-test
In this exercise, we’ll use the F-test to examine the relationship between the features and response variable. We will use this method to do what is called univariate feature selection: the practice of testing features one by one against the response variable, to see which ones have predictive power. Perform the following steps to complete the exercise:
-
Our first step in doing the ANOVA F-test is to separate out the features and response as NumPy arrays, taking advantage of the list we created, as well as integer indexing in pandas:
X = df[features_response].iloc[:,:-1].values y = df[features_response].iloc[:,-1].values print(X.shape, y.shape)
The output should show the shapes of the features and response:
# (26664, 17) (26664, )
There are 17 features, and both the features and response arrays have the same number of samples as expected.
-
Import the
f_classif
function and feed in the features and response:from sklearn.feature_selection import f_classif [f_stat, f_p_value] = f_classif(X, y)
There are two outputs from
f_classif
: the F-statistic and the p-value, for the comparison of each feature to the response variable. Let’s create a new DataFrame containing the feature names and these outputs, to facilitate our inspection. One way to specify a new DataFrame is by using a dictionary, with key/value pairs of column names and the data to be contained in each column. We show the DataFrame sorted (ascending) on p-value. -
Use this code to create a DataFrame of feature names, F-statistics, and p-values, and show it sorted on p-value:
f_test_df = pd.DataFrame({'Feature':features_response[:-1], 'F statistic':f_stat,\ 'p value':f_p_value}) f_test_df.sort_values('p value')
The output should look like this:
Get hands-on with 1400+ tech skills courses.