Logistic regression is a supervised classification algorithm that only takes discrete values as input. The model it creates is regression-based and predicts if the probability of a given data belongs to 1
or 0
. These values can be referred to as any of the categories used to classify data.
Imagine a situation where you might have to decide if an email is spam(1) or not(0):
Logistic regression introduces a decision threshold – the threshold value holds significance because it directly affects the classification problem and regression model. This threshold allows the value to range strictly between 0 and 1.
Logistic Regression is used when the
In reference to the example about spam emails, a linear model cannot be used because it is unbounded, meaning that no classification-based threshold can be made. Therefore, we use logistic regression.
Linear regression assumes that the data follows a linear function, while logistic regression models the data using the sigmoid function.
The threshold value is completely dependent on two factors, precision and recall. Ideally, we want both precision and recall to have a value of 1
. However, this is not realistically possible, so we settle on trade-off values.
Let’s discuss how we can decide their values:
Low Precision/High Recall Example: We never want anyone affected with a disease to be classified as “not affected” without paying attention to the fact that the patient may have been wrongfully diagnosed. Therefore, when a situation arises where we want the number of false-negatives to decrease without impacting the false positives, we use a low precision value and high recall value.
High Precision/Low Recall Example: a company has released a new advertisement and are classifying whether or not people will react positively. In this case, the company needs to make sure most people react positively. Therefore, we want to reduce the number of false positives without necessarily reducing false negatives; so, we choose a decision value that has a high value of precision or a low value of recall.
Logistic regression can be classified into 3 categories:
Binomial: There are only 2 possible types of data, 0
or 1
. These types may represent “win” vs. “loss,” “pass” vs. “fail,” etc.
Multinomial: The target variable can have 3 or more possible types that are not ordered as types have no quantitative significance (e.g., “disease A” vs. “disease B” vs. “disease C”).
Ordinal: Target variables are with ordered categories. For example, a test score can be categorized as, “very poor,” “poor,” “good,” or “very good.” Therefore, each category is given a respective score.