Linear regression is used to predict an output that is in the form of a real-valued number.
Before we delve deep into the topic of this blog, let’s look at some key takeaways:
Key Takeaways
Linear regression: Predicts continuous values (e.g., weight, price) by fitting a straight line through data.
Logistic regression: Classifies data by predicting the probability of an outcome (e.g., overweight or not).
Key differences: Linear regression predicts values, while logistic regression predicts probabilities for classification. Linear regression requires correlated variables, while logistic regression works with non-correlated variables.
Applications: Linear regression is used for value forecasting; logistic regression is used for classification tasks like weather prediction.
“The coldest winter I have ever spent was a summer in San Francisco.”
The above quote is commonly attributed to Mark Twain. Let’s not digress further from the topic and get back to the data, the prediction, and everything related to numbers. We're going to be diving deep and talking about the daily temperature. Let’s start with the summer of 2024 in San Franciso and look at the daily minimum and maximum temperatures.
Let's assume I want to see whether the daily maximum temperature and the daily minimum temperature are related or not. It seems that on cold days, the minimum, as well as the maximum temperatures will go down. Similarly, on warmer days, both these numbers will increase, and my guess will be that these numbers should have some relation between them.
If these numbers are related, then with the knowledge of one of the values, we should be able to predict the other pretty accurately. This is exactly what we do in linear regression. First, we find out whether the two data are related or not and what is the relation between the two, i.e., we find a linear relation (defined by a straight line) that maps one value to the other.
In the above chart, I plotted the daily minimum and maximum temperatures for the month of September 2024. The location I picked is San Francisco. Now, we want to check if the values shown above are related. For this, we use a concept from statistics called correlation, and we ask for the following correlation.
If the answer is yes, we can use the data to make predictions. With a high correlation, we can be more confident about our predictions.
The correlation coefficient between two numbers is a number between -1 and 1. Values close to zero mean very low correlation. A positive value means increasing one value will most likely increase the other value.
The correlation coefficient of this data is around
To keep this blog simple and more accessible, mathematical details have been left out intentionally.
In linear regression, there is another keyword: the regression line. Linear regression draws a line—the regression line—through the data points that pass close to the maximum number of data points, in the sense that the sum of distances of these points from the regression line is the least. This line shows the overall trend and where this trend will lead us. Let's look at the regression line on the points that we discussed earlier.
The red line is the trend line and it shows that the maximum temperature will remain close to 70. This does not make much sense, and again, it cannot be used for prediction because the correlation is close to zero.
Now, let's look at a similar data, but for the month of January 2024 in the diagram below.
The regression coefficient is
As we know, there are 31 days in January. The above plot was constructed using the first 30 days. Now let's see if we can use the minimum temperature of January 31, 2024 to predict the maximum temperature for the same day.
The minimum temperature on that day was 53. If we look at the red line where minimum temperature is 53, the maximum temperature is around 60.5. In fact, the actual temperature maximum temperature on that day was 60. This is a pretty good guess.
Now, let's list down the steps in linear regression.
To use the linear regression, we need to follow the steps given below.
Separate the variables (input and output).
Load the data (in an Excel sheet or in any programming language).
Find the regression line—the line that best fits the data.
For a new point, use the regression line to predict the output.
Now, we want to observe what other information can we extract on the basis of these temperatures.
The diagram above shows minimum and maximum temperatures for the first 30 days of January 2024. Red color indicates that it did not rain on that day, and blue means that it rained. Now, I want to see whether the minimum and maximum temperatures can be used to guess whether it rained on January 31, 2024 or not. The temperature for this day is shown with a black dot.
We can use logistic regression in such scenarios, where we want to predict the possibility of a specific outcome. In fact, logistic regression returns the probability of a particular outcome. It does this by assigning a score to each combination of inputs. This score can range from negative infinity to positive infinity. A score close to zero means that the probability is around 50%. Scores with a high positive value indicates the likelihood of an event occuring, and the scores with a high negative value indicates that the event has low chances of happening. This score can be converted to probability, as we’ll see later on
Logistic regression requires labeled data, such as the one we saw earlier, where each data point is labeled into two classes. Let's call them positive and negative. Using this data, we want to classify a new point
For logistic regression, apply the following steps.
Load the data, along with their class or label.
Divide the data into two parts: training and testing.
Use the training data to train the model i.e., weights of the hyperparameters are calculated.
Use the testing data to evaluate the efficiency of logistic regression.Input the test point to calculate the score, using the weights found above.
Apply the logistic function to convert the score into a probability.
This probability is the probability that the input belongs to a particular class.
If the probability is greater than 1/2, the function returns a positive result, otherwise it returns negative. In case of multiple classes, the model returns the class with the highest probability.
For simplicity, some of the steps of the logistic regression have been omitted. For implementation, you can read the process given in the project Implement Logistic Regression in Python from Scratch.
Let's look at the differences between the linear regression and the logistic regression in the table below before discussing each of these points.
Linear Regression | Logistic Regression | |
Function in action | Linear – Straight line | Sigmoid – S-shaped |
Output | A number | A probability |
Typically used for | Predicting the output | Classifying the input |
Requires correlation | Yes | No |
Evaluation metrics | Correlation coefficient, root mean square error | Accuracy, precision, recall, F-1 score |
In linear regression, a best fit straight line is drawn, making sure that the distance between the available data points and the line is minimum. This straight line can be drawn in more than two dimensions, hence there is no restriction on the number of independent (input) variables. In case there are more than one independent variables, such a regression is called multiple linear regression. This line is used for output prediction.
In logistic regression, an S-shaped logistic function – the sigmoid function – is used to convert the raw score into a probability. If the probability is more than half, then this means that it is highly likely that the data belongs to a particular class. A probability of less than half will mean otherwise.
In linear regression, the output is a real-valued number. This number can even be negative.
In logistic regression, after the probability is computed, the data is assigned a particular label (class with highest probability), which is the actual output of the logistic regression.
In linear regression, if the input variables are not correlated with the output variable, then the results will be inconclusive. A high correlation, whether positive or negative, indicates that the output will be very close to the actual value.
In logistic regression, it is not a good idea to use variables that are highly correlated to each other. Using such variables will result in splitting up the effect of the outcome (class) and such variables.
In linear regression, finding the best-fit line involves reducing the error, or more technically, the root mean square error (RMSE). lower RMSE means better prediction. Also, the correlation coefficient (a number between
In logistic regression, the values of accuracy, precision, recall, and F-1 score are different types of metrics used to provide the effectiveness of the results. Each one of them have their own applications, and are used in different situations.
Linear regression is used for predicting a number, e.g. stock forecasting, operational efficiency of machines, number of subscribers, etc.
Logistic regression is used for weather prediction, text analysis, etc.
Here are some useful resources regarding the implementation of these models.
If you want to learn implementing linear regression or logistic regression, here are some of the resources.
For hands-on practice of linear regression and logistic regression, I recommend having a look at the following projects.
For learning some related concepts, and the applications of these topics, you can consult the courses below.
Or you can become a machine learning engineer by going through the following learning path.
Learn to Code: Become a Machine Learning Engineer
Start your journey to becoming a machine learning engineer by mastering the fundamentals of coding with Python. Learn machine learning techniques, data manipulation, and visualization. As you progress, you'll explore object-oriented programming and the machine learning process, gaining hands-on experience with machine learning algorithms and tools like scikit-learn. Tackle practical projects, including predicting auto insurance payments and customer segmentation using K-means clustering. Finally, explore the deep learning models with convolutional neural networks and apply your skills to an AI-powered image colorization project.
Free Resources