Single target example
It’s possible to reformulate generalized linear regression to incorporate the kernel trick. For example, the loss function L ( w ) L(\bold w) L ( w ) for generalised linear regression with a single target is as follows:
L ( w ) = ∥ ϕ ( X ) w − y ∥ 2 2 + λ ∥ w ∥ 2 2 L(\bold w)= \|\phi(X) \bold w-\bold y\|_2^2 + \lambda \|\bold w\|_2^2
L ( w ) = ∥ ϕ ( X ) w − y ∥ 2 2 + λ ∥ w ∥ 2 2
Note:
w T w = ∥ w ∥ 2 2 \bold w^T\bold w = \|\bold w\|_2^2
w T w = ∥ w ∥ 2 2
Setting the derivative of the loss with respect to w \bold w w to 0 \bold 0 0 results in the following:
ϕ ( X ) T ( ϕ ( X ) w − y ) + λ w = 0 w = − 1 λ ϕ ( X ) T ( ϕ ( X ) w − y ) w = ϕ ( X ) T a \begin{align*}
& \phi(X)^T(\phi(X)\bold w-\bold y)+\lambda \bold w = \bold 0 \\
& \bold w = -\frac{1}{\lambda}\phi(X)^T(\phi(X)\bold w-\bold y) \\
& \bold w = \phi(X)^T\bold a \\
\end{align*}
ϕ ( X ) T ( ϕ ( X ) w − y ) + λ w = 0 w = − λ 1 ϕ ( X ) T ( ϕ ( X ) w − y ) w = ϕ ( X ) T a
Here, a = − 1 λ ( ϕ ( X ) w − y ) \bold a=-\frac{1}{\lambda}(\phi(X)\bold w-\bold y) a = − λ 1 ( ϕ ( X ) w − y ) .
Reparameterization
We can now parametrize the loss function with parameter vector a \bold a a by replacing w \bold w w with ϕ ( X ) T a \phi(X)^T\bold a ϕ ( X ) T a , as follows:
L ( a ) = ∥ ϕ ( X ) ϕ ( X ) T a − y ∥ 2 2 + λ ∥ ϕ ( X ) T a ∥ 2 2 = ∥ ϕ ( X ) ϕ ( X ) T a − y ∥ 2 2 + λ a T ϕ ( X ) ϕ ( X ) T a = ∥ K a − y ∥ 2 2 + λ a T K a \begin{align*}
L(\bold a)&= \|\phi(X) \phi(X)^T\bold a-\bold y\|_2^2 + \lambda \|\phi(X)^T\bold a\|_2^2 \\
&= \|\phi(X) \phi(X)^T\bold a-\bold y\|_2^2 + \lambda \bold a^T \phi(X)\phi(X)^T\bold a \\
&= \|K\bold a-\bold y\|_2^2 + \lambda \bold a^T K\bold a \\
\end{align*}
L ( a ) = ∥ ϕ ( X ) ϕ ( X ) T a − y ∥ 2 2 + λ ∥ ϕ ( X ) T a ∥ 2 2 = ∥ ϕ ( X ) ϕ ( X ) T a − y ∥ 2 2 + λ a T ϕ ( X ) ϕ ( X ) T a = ∥ K a − y ∥ 2 2 + λ a T K a
Closed-form solution
Setting the derivative of the loss L ( a ) L(\bold a) L ( a ) with respect to a \bold a a to 0 \bold 0 0 results in the following:
K T ( K a − y ) + λ K a = 0 K^T(K\bold a - \bold y)+\lambda K \bold a = \bold 0
K T ( K a − y ) + λ K a = 0
As the Gram matrix K K K is symmetric, that is, K T = K K^T=K K T = K , so the above equation can be written as follows:
K ( K a − y ) + λ K a = 0 K ( K a − y + λ a ) = 0 ( K + λ I ) a = y a = ( K + λ I ) − 1 y \begin{align*}
& K(K\bold a - \bold y)+\lambda K \bold a = \bold 0 \\
& K(K\bold a - \bold y + \lambda \bold a) = \bold 0 \\
& (K + \lambda I)\bold a = \bold y \\
& \bold a = (K + \lambda I)^{-1} \bold y
\end{align*}
K ( K a − y ) + λ K a = 0 K ( K a − y + λ a ) = 0 ( K + λ I ) a = y a = ( K + λ I ) − 1 y
Prediction
Once a \bold a a is computed, the prediction y ^ t \hat y_t y ^ t on an input vector x t \bold x_t x t can be made as follows:
y ^ t = w T ϕ ( x t ) = a T ϕ ( X ) ϕ ( x t ) = [ a 1 a 2 … a n ] [ ϕ ( x 1 ) T ϕ ( x t ) ϕ ( x 2 ) T ϕ ( x t ) ⋮ ϕ ( x n ) T ϕ ( x t ) ] = [ a 1 a 2 … a n ] [ k ( x 1 , x t ) k ( x 2 , x t ) ⋮ k ( x n , x t ) ] \begin{align*}
\hat y_t &= \bold w^T \phi(\bold x_t)\\
&= \bold a^T \phi(X) \phi(\bold x_t) \\
&= \begin{bmatrix}a_1 & a_2 & \dots & a_n\end{bmatrix} \begin{bmatrix}\phi(\bold x_1)^T\phi(\bold x_t) \\ \phi(\bold x_2)^T\phi(\bold x_t) \\ \vdots \\ \phi(\bold x_n)^T\phi(\bold x_t)\end{bmatrix} \\ \\
&= \begin{bmatrix}a_1 & a_2 & \dots & a_n\end{bmatrix} \begin{bmatrix}k(\bold x_1,\bold x_t) \\ k(\bold x_2,\bold x_t) \\ \vdots \\ k(\bold x_n,\bold x_t)\end{bmatrix}
\end{align*}
y ^ t = w T ϕ ( x t ) = a T ϕ ( X ) ϕ ( x t ) = [ a 1 a 2 … a n ] ⎣ ⎡ ϕ ( x 1 ) T ϕ ( x t ) ϕ ( x 2 ) T ϕ ( x t ) ⋮ ϕ ( x n ) T ϕ ( x t ) ⎦ ⎤ = [ a 1 a 2 … a n ] ⎣ ⎡ k ( x 1 , x t ) k ( x 2 , x t ) ⋮ k ( x n , x t ) ⎦ ⎤
Implementation
We now implement the generalized linear regression for a single target using the kernel trick.