Univariate Feature Selection

Learn about univariate feature selection, a technique of testing features one by one against the response variable.

What it does and doesn’t do

In this chapter, we have learned techniques for going through features one by one to see whether they have predictive power. This is a good first step, and if you already have features that are very predictive of the outcome variable, you may not need to spend much more time considering features before modeling. However, there are drawbacks to univariate feature selection. In particular, it does not consider the interactions between features. For example, what if the credit default rate is very high specifically for people with both a certain education level and a certain range of credit limit?

Also, with the methods we used here, only the linear effects of features are captured. If a feature is more predictive when it’s undergone some type of transformation, such as a polynomial or logarithmic transformation, or binning (discretization), linear techniques of univariate feature selection may not be effective. Interactions and transformations are examples of feature engineering, or creating new features, in these cases from existing features. The shortcomings of linear feature selection methods can be remedied by non-linear modeling techniques including decision trees and methods based on them, which we will examine later. But there is still value in looking for simple relationships that can be found by linear methods for univariate feature selection, and it is quick to do.

Understanding logistic regression and the sigmoid function

In this section, we will open the “black box” of logistic regression all the way: we will gain a comprehensive understanding of how it works. We’ll start off by introducing a new programming concept: functions. At the same time, we’ll learn about a mathematical function, the sigmoid function, which plays a key role in logistic regression.

Python functions

In the most basic sense, a function in computer programming is a piece of code that takes inputs and produces outputs. You have been using functions throughout the course: functions that were written by someone else. Any time that you use syntax such as output = do_something_to(input), you have used a function. For example, NumPy has a function you can use to calculate the mean of the input:

np.mean([1, 2, 3, 4, 5]) 
# 3.0

Functions abstract away the operations being performed so that, in our example, you don’t need to see all the lines of code that it takes to calculate a mean, every time you need to do this. For many common mathematical functions, there are already pre-defined versions available in packages such as NumPy. You do not need to “reinvent the wheel.” The implementations in popular packages are likely popular for a reason: people have spent time thinking about how to create them in the most efficient way. So, it would be wise to use them. However, because all the packages we are using are open source, if you are interested in seeing how the functions in the libraries we use are implemented, you are able to look at the code within any of them.

Now, for the sake of illustration, let’s learn Python function syntax by writing our own function for the arithmetic mean. Function syntax in Python is similar to for or if blocks, in that the body of a function is indented and the declaration of the function is followed by a colon. Here is the code for a function to compute the mean:

def my_mean(input_argument): 
   output = sum(input_argument)/len(input_argument)
   return(output) 

After you execute the code cell with this definition, the function is available to you in other code cells in the notebook. Take the following example:

my_mean([1, 2, 3, 4, 5])
# 3.0

The first part of defining a function, as shown here, is to start a line of code with def, followed by a space, followed by the name you’d like to call the function. After this come parentheses, inside which the names of the parameters of the function are specified. Parameters are names of the input variables, where these names are internal to the body of the function: the variable names defined as parameters are available within the function when it is called (used), but not outside the function. There can be more than one parameter; they would be comma-separated. After the parentheses comes a colon.

The body of the function is indented and can contain any code that operates on the inputs. Once these operations are done, the last line should start with return and contain the output variable(s), comma-separated if there is more than one. We are leaving out many fine points in this very simple introduction to functions, but those are the essential parts you need to get started.

The power of a function comes when you use it. Notice how after we define the function, in a separate code block we can call it by the name we’ve given it, and it operates on whatever inputs we pass it. It’s as if we’ve copied and pasted all the code to this new location. But it looks much nicer than actually doing that. And if you are going to use the same code many times, a function can greatly reduce the overall length of your code.

As a brief additional note, you can optionally specify the inputs using the parameter names explicitly, which can be clearer when there are many inputs:

my_mean(input_argument=[1, 2, 3])
# 2.0

The sigmoid function

Now that we’re familiar with the basics of Python functions, we are going to consider a mathematical function that’s important to logistic regression, called sigmoid. This function may also be called the logistic function. The definition of sigmoid is as follows:

f(X)=sigmoid(X)=11+eXf(X) = sigmoid(X) = \frac{1}{1+e^{-X}}

Get hands-on with 1300+ tech skills courses.