Univariate Linear Regression

Here, you’ll learn more about Regression and the concepts of Univariate Linear Regression.

Univariate Linear Regression

In Univariate Linear Regression, we have one independent variable xx which we use to predict a dependent variable yy.

We will be using the Tips Dataset from Seaborn’s Datasets to illustrate theoretical concepts.



We will be using the following columns from the dataset for Univariate Analysis.

  • Total_bill: It is the total bill of food served.

  • Tip: It is the tip given on the cost of food.


Goal of Univariate Linear Regression: The goal is to predict the “tip” given on a “total_bill”. The regression model constructs an equation to do so.

If we plot the scatter plot between the independent variable (total_bill) and dependent variable (tip), we will get the plot below.



  • We can see that the points in the scatter plot are mostly scattered along the diagonal.

  • This is an indication that there can be some positive correlation between the total_bill and tip. This will be fruitful in modeling.


Working

The univariate Linear Regression model comes up with the following equation of the straight line.

y^=w0+w1x\hat{y} = w_0 + w_1 * x

Or

tip_predicted=w0+w1total_billtip\_predicted = w_0 + w_1 * total\_bill

Goal: Find the values of w0w_0 and w1w_1, where w0w_0 and w1w_1 are the parameters, so that the predicted tip (y^\hat{y}) is as close to the actual tip (yy) as possible. Mathematically, we can model the problem as seen below.

J(w0,w1)J(w_0, w_1) = 12mi=1m(y^iyi)2\frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^i-y^i)^2

  • J(w0,w1)J(w_0, w_1) is the cost function, which an algorithm tries to minimize by finding the values of w0w_0 and w1w_1. These values give us the minimum value of the above function.

  • yiy^i is the actual output value of a training instance ii, where i=1,2,3..i =1,2,3 ..

  • y^i\hat{y}^i is the predicted output value of a training instance ii. Where i=1,2,3..i =1,2,3 ..

  • i=1m\sum_{i=1}^{m} denotes the sum across all the training samples.

  • i=1m(y^iyi)2\sum_{i=1}^{m}(\hat{y}^i-y^i)^2 denotes the sum of the squared difference between the actual and predicted values across all the training instances.

  • 12m\frac{1}{2m} is the term for normalization purposes. mm is the number of training instances.

  • Let’s understand the working of the above formula with the help of hypothetical data.

Notation Actual Value Predicted Value (Predicted Value - Actual Value)2
y1y^1 10.5 15 20.25
y2y^2 15.5 17 2.25
y3y^3 7.34 5 5.47

Let’s suppose w0=10w_0 = 10 and w1=15w_1 = 15

Here mm = 3

J(10,15)J(10, 15) = 12mi=1m(y^iyi)2\frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^i-y^i)^2 = 12(3)(20.25+2.25+5.47)=4.66\frac{1}{2(3)} (20.25 + 2.25 + 5.47) = 4.66

Gradient descent

Gradient descent is an optimization algorithm that helps us find the optimal values of w0w_0 and w1w_1. It is used as a backbone behind many Machine Learning algorithms. Let’s rephrase our goal.

Goal: Find the values of w0w_0 and w1w_1 that give us the minimum value of cost of the J(w0,w1)J(w_0, w_1) function.

Gradient descent helps us find the optimal values. It is outlined below.

  • Start with initial values of w0w_0 and w1w_1.

  • Keep changing w0w_0 and w1w_1 until we achieve the minimum of J(w0,w1)J(w_0, w_1) .

Intuition

Gradient descent works as follows:


Repeat until convergence {

wj=wjαwjJ(w0,w1)w_j = w_j - \alpha \frac{\partial}{\partial w_j} J(w_0 , w_1)

}


  • Here, j = 0, 1.
  • wjJ(w0,w1)\frac{\partial}{\partial w_j} J(w_0 , w_1) is the partial derivative of the cost function.
  • Both w0w_0 and w1w_1 are simultaneously updated as seen below.

temp0=w0αw0J(w0,w1)temp0 = w_0 - \alpha \frac{\partial}{\partial w_0} J(w_0 , w_1)

temp1=w1αw1J(w0,w1)temp1 = w_1 - \alpha \frac{\partial}{\partial w_1} J(w_0 , w_1)

w0=temp0w_0 = temp0

w1=temp1w_1 = temp1

  • Here, α\alpha is a learning parameter. The α\alpha value is chosen after some careful experimentation.

  • Setting α\alpha too small can lead to slow convergence of gradient descent.

  • Setting α\alpha too large, can divert gradient descent from choosing the optimal values of w0w_0 and w1w_1 which gives us the minimum value of cost function.

There are modules in Python that help us choose the optimal values of α\alpha. This α\alpha value is called hyper-parameter. Choosing the optimal values by embedding in techniques and reading from research papers is called hyperparameter optimization.

Gradient descent for Univariate Linear Regression

Now, we can apply the gradient descent by opening the derivative term for Univariate Linear Regression, as seen below.


Cost function:

J(w0,w1)J(w_0, w_1) = 12mi=1m(y^iyi)2\frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^i-y^i)^2


Derivatives:

w0J(w0,w1)=1mi=1m(y^iyi)\frac{\partial}{\partial w_0} J(w_0 , w_1) = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^i-y^i)

w1J(w0,w1)=1mi=1m(y^iyi).xi\frac{\partial}{\partial w_1} J(w_0 , w_1) = \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^i-y^i) . x^i


Gradient descent:

Repeat until convergence {

w0=w0α1mi=1m(y^iyi)w_0 = w_0 - \alpha \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^i-y^i)

w1=w1α1mi=1m(y^iyi).xiw_1 = w_1 - \alpha \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^i-y^i) . x^i

}


Versions of gradient descent

  • Batch Gradient Descent: Each step of gradient descent (i.e updating the parameters (w0,w1w_0, w_1)) takes the whole training dataset.

  • Stochastic Gradient Descent: Gradient Descent performs steps (i.e updating the parameters (w0,w1w_0, w_1)) after every training instance or sample. This is also called online learning or incremental or out-of-core learning.

  • Mini-Batch Stochastic Gradient descent: Gradient Descent takes steps (i.e updating the parameters after a defined set of the training sets) that are called batch size.

Conclusion

Once the gradient descent has done its job of finding the w0,w1w_0, w_1 parameters, we can place the values back into the equation and get the tip_predicted for a particular total_bill.

tip_predicted=w0+w1total_billtip\_predicted = w_0 + w_1 * total\_bill

Let’s suppose gradient descent returns w0=10w_0 = 10 and w1=15w_1 = 15.

tip_predicted=10+15total_billtip\_predicted = 10 + 15 * total\_bill

Get hands-on with 1400+ tech skills courses.