Assumptions of Logistic Regression

Because it is a classical statistical model, similar to the F-test and Pearson correlation we already examined, logistic regression makes certain assumptions about the data. While it’s not necessary to follow every one of these assumptions in the strictest possible sense, it’s good to be aware of them. That way, if a logistic regression model is not performing very well, you can try to investigate and figure out why, using your knowledge of the ideal situation that logistic regression is intended for. You may find slightly different lists of the specific assumptions from different resources. However, those that are listed here are widely accepted.

The four assumptions of logistic regression

Here are the four most widely accepted assumptions of logistic regression.

Features are linear in the log odds

Logistic regression is a linear model, so it will only work well as long as the features are effective at describing a linear trend in the log odds. In particular, logistic regression won’t capture interactions, polynomial features, or the discretization of features, on its own. You can, however, specify all of these as “new features”—even though they may be engineered from existing features.

Remember from the previous section that the most important feature from univariate feature exploration, PAY_1, was not found to be linear in the log odds.

No multicollinearity of features

Multicollinearity means that features are correlated with each other. The worst violation of this assumption is when features are perfectly correlated with each other, such as one feature being identical to another, or when one feature equals another multiplied by a constant. We can investigate the correlation of features using the correlation plot that we’re already familiar with from univariate feature selection. Here is the correlation plot from the previous section:

Get hands-on with 1200+ tech skills courses.