What is KL divergence?

Introduction

Often in real-world applications, we need some measure to compare probability distribution. Normal metrics are not useful in such cases and thus we need some other useful measure.

Divergence measures are the measures that are normally used for this task. Kullback-Leibler (KL) divergence is the most commonly used divergence measure.

KL divergence is a way of measuring the deviation between two probability distributions. In the case of discrete distributions, KL is defined as:

DKL(pq)=K=1Kpklog(pkqk)pk+qkD_{KL}(p||q) = \sum^{K}_{K=1}p_k\log(\frac{p_k}{q_k}) - p_k + q_k

And for continuous distributions:

DKL(pq)=(p(x)log(p(x)q(x))p(x)+q(x))dxD_{KL}(p||q) = \int \bigg(p(x)\log(\frac{p(x)}{q(x)}) - p(x) + q(x)\bigg)d_x

WhereDKL(pq)0 D_{KL} (p||q) \ge 0and DKL(pq)=0 D_{KL} (p||q) = 0 only when p=q p = q .

The symbol || denotes that this is a divergence.

Note: Some authors also use the term relative entropy for this, but here we'll follow the practice followed in Convex Optimization by Stephen Boyd and Lieven Vandenberghe, Cambridge University Press, 2004.

Difference between divergence and metric

Even though KL divergence can give us a way to measure the distance between two probability distributions, it is not a metric. The differences between divergence and metrics are:

  • Metrics are symmetric, whereas divergence is asymmetric.
  • Metrics satisfy the triangle inequality, that is, d(x,y)d(x,z)+d(z,y)d(x,y)\leq d(x,z) + d(z,y) where d d is the distance function. If any two of x,y, x, y, and z z are equal, then the inequality holds.

Applications

Data compression

Shortcodes replace the most frequently appearing words when compressing a data file such as .txt. A well-known probability distribution can make a lot of work more manageable. KL divergence is a good method to compare true probability with well-known distributions.

Variational inference

Variational inference is an optimization problem where we can use KL divergence as an approximation method to find how an intractable distribution is close to a tractable distribution.

Variational autoencoders

The above-mentioned use of KL divergence makes it a perfect loss function in Variational autoencoders where we need to

Implementation

We'll implement KL divergence using NumPy in the code below:

import numpy as np
def kl_divergence(p, q):
"""
Parameters:
p: true distribution
q: known distribution
---------------------
Returns:
KL divergence of p and q distributions.
"""
return np.sum(np.where(p != 0, p * np.log(p / q) - p + q, 0))
p = np.asarray([0.4629, 0.2515, 0.9685])
q = np.asarray([0.1282, 0.8687, 0.4996])
print("KL divergence between p and q is: ", kl_divergence(p, q)) # this should return 0.7372678653853546
print("KL divergence between p and p is: ", kl_divergence(p, p)) # this should return 0
Implementation of KL divergence in python

Code explanation

  • Line 12: This line implements the following equation of KL divergence:
DKL(pq)=K=1Kpklog(pkqk)pk+qk D_{KL}(p||q) = \sum^{K}_{K=1}p_k\log(\frac{p_k}{q_k}) - p_k + q_k

Note: KL divergence between the same probability distributions is 0

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved