Kernel Logistic Regression

Learn how to implement kernel logistic regression along with its derivation.

We'll cover the following

We can kernelize logistic regression just like other linear models by observing that the parameter vector w\bold w is a linear combination of the feature vectors Φ(X)\Phi(X), that is:

w=Φ(X)a\bold w = \Phi(X) \bold a

Here, a\bold a is the dual parameter vector, and, in this case, the loss function now depends upon a\bold a.

Minimzing BCE Loss

We need to find the model parameters (a\bold a) that result in the smallest BCE loss function value to minimize the BCE loss. The BCE loss is defined as:

LBCE(a)=i=1nLiLi=(yilog(y^i)+(1yi)log(1y^i))y^i=σ(zi)=11+ezizi=aTΦ(X)Tϕ(xi)\begin{align*} L_{BCE}(\bold{a})&=\sum_{i=1}^n L_i\\ L_i &= -(y_ilog(\hat y_i)+(1-y_i)log(1-\hat y_i)) \\ \hat{y}_i&=\sigma(z_i)=\frac{1}{1+e^{-z_i}} \\ z_i&=\bold a^T \Phi(X)^T\phi(\bold{x}_i) \end{align*}

To minimize LBCEL_{BCE}, we need to compute its gradient with respect to the parameters a\bold a as follows:

LBCEa=i=1NLiy^iy^izizia=i=1N[yi1y^i(1yi)11y^i]y^i(1y^i)Φ(X)Tϕ(xi)=i=1N[yiy^i]Φ(X)Tϕ(xi)=i=1N[y^iyi]Φ(X)Tϕ(xi)=K(y^y)T\begin{align*} \frac{\partial L_{BCE}}{\partial \mathbf{a}} &= \sum_{i=1}^{N}\frac{\partial L_i}{\partial \hat{y}_i} \frac{\partial \hat{y}_i}{\partial z_i}\frac{\partial z_i}{\partial \mathbf{a}}\\ &= - \sum_{i=1}^{N} \left[ y_i \frac{1}{\hat{y}_i} - (1 - y_i) \frac{1}{1 - \hat{y}_i} \right] \hat{y}_i (1 - \hat{y}_i) \Phi(X)^T\phi(\bold{x}_i)\\ &=-\sum_{i=1}^{N} \left[ y_i-\hat{y}_i \right] \Phi(X)^T\phi(\bold{x}_i)\\ &=\sum_{i=1}^{N} \left[ \hat{y}_i-y_i \right] \Phi(X)^T\phi(\bold{x}_i)\\ &=K(\hat{\bold y}-\bold y)^T \end{align*}

Here, KK is the Gram matrixThe Gram matrix is a square matrix containing dot products of all pairs of data points after mapping them to a higher-dimensional feature space using a kernel function.:

K=Φ(X)Tϕ(xi)K = \Phi(X)^T\phi(\bold{x}_i)

and

Φ(X)=[ϕ(x1)ϕ(x2)ϕ(xn)]\Phi(X) = \begin{bmatrix}\phi(\bold x_1) & \phi(\bold x_2) & \dots & \phi(\bold x_n)\end{bmatrix}

Note: The Gram matrix is used to compute the dot products in feature space using kernel functions without actually computing the feature vectors themselves, making it computationally efficient.

Implementation

The code below implements the kernel logistic regression for binary classification:

Get hands-on with 1200+ tech skills courses.