Exercise: F-test and Univariate Feature Selection

Learn how to select the univariate features using the F-test.

Univariate feature selection using F-test

In this exercise, we’ll use the F-test to examine the relationship between the features and response variable. We will use this method to do what is called univariate feature selection: the practice of testing features one by one against the response variable, to see which ones have predictive power. Perform the following steps to complete the exercise:

  1. Our first step in doing the ANOVA F-test is to separate out the features and response as NumPy arrays, taking advantage of the list we created, as well as integer indexing in pandas:

    X = df[features_response].iloc[:,:-1].values
    y = df[features_response].iloc[:,-1].values
    print(X.shape, y.shape)
    

    The output should show the shapes of the features and response:

    # (26664, 17) (26664, )
    

    There are 17 features, and both the features and response arrays have the same number of samples as expected.

  2. Import the f_classif function and feed in the features and response:

    from sklearn.feature_selection import 
    f_classif 
    [f_stat, f_p_value] = f_classif(X, y)
    

    There are two outputs from f_classif: the F-statistic and the p-value, for the comparison of each feature to the response variable. Let’s create a new DataFrame containing the feature names and these outputs, to facilitate our inspection. One way to specify a new DataFrame is by using a dictionary, with key/value pairs of column names and the data to be contained in each column. We show the DataFrame sorted (ascending) on p-value.

  3. Use this code to create a DataFrame of feature names, F-statistics, and p-values, and show it sorted on p-value:

    f_test_df = pd.DataFrame({'Feature':features_response[:-1], 'F statistic':f_stat,\ 
    'p value':f_p_value}) 
    f_test_df.sort_values('p value')
    

    The output should look like this:

Get hands-on with 1400+ tech skills courses.