Logistic Regression

In this lesson, we will use logistic regression to do the classification task.

What is logistic regression?

Logistic regression is a Machine Learning classification algorithm that is used to predict the probability of certain classes based on some dependent variables. In short, the logistic regression model computes a sum of the input feature (in most cases, there is a bias term), and calculates the logistic of the result.

The output of a logistic regression is always between (0, 1), which is suitable for a binary classification task. The higher the value, the higher the probability that the current sample is classified as class=1, and vice versa.

hθ(x)=11+eθxh_{\theta}(x) = \frac{1}{1+e^{-\theta x}}

As the formula above shows, θ\theta is the parameter we want to learn or train or optimize and xx is the input data. The output is the prediction value when the value is closer to 1, which means the instance is more likely to be a positive sample(y=1). If the value is closer to 0, this means the instance is more likely to be a negative sample(y=0).

To optimize our task, we need to define a loss function(cost or objective function) for this task. In logistic regression, we use the log-likelihood loss function.

J(θ)=1mi=1m(yilog(pi)+(1yi)log(1pi))J(\theta)= - \frac{1}{m} \sum_{i=1}^{m}(y^{i}*log(p^{i})+(1-y^{i})log(1-p^{i}))

mm is the number of samples in the training data. yiy^{i} is the label of the i-th sample, pip^{i} is the prediction value of the i-th sample. When the current sample’s label is 1, then the second term of the formula is 0. We hope the larger the first term, the better, and vice versa. Finally, we add the loss of all samples, take the average, and add a negative sign. Our goal is to minimize the J(θ)J(\theta). When J(θ)J(\theta) is smaller, it means that the model fits better on the data set.

There is no closed-form method to find θ\theta. To achieve this goal, we need to use some optimization algorithms, such as gradient descent. Since J(θ)J(\theta) is a convex function, the gradient descent is guaranteed to find a global minimum.

Let’s start coding.

Get hands-on with 1300+ tech skills courses.