...

/

The Variational Objective

The Variational Objective

Learn about the elements that allow us to create effective encodings to sample new images from a space of random numerical vectors.

Let’s examine how to optimally compress information in numerical vectors using neural networks. To do so, each element of the vector should encode distinct information from the others, a property we can achieve using a variational objective. This variational objective is the building block for creating VAE networks.

Creating efficient encodings

Let’s start by quantifying more rigorously what makes such an encoding “good” and allows us to recreate images well. We'll need to maximize the posterior:

A problem occurs when the probability of xx is extremely high dimensional. As you saw, this can occur in even simple data such as binary MNIST digits, where we have 2number of pixels2^{\text{number of pixels}} possible configurations to integrate over (in a mathematical sense of integrating over a probability distribution) to get a measure of the probability of an individual image. In other words, the density p(x)p(x) is intractable, making the posterior p(zx)p(z|x), which depends on p(x)p(x), likewise intractable.

In some cases, we can use simple cases such as binary units to compute an approximation such as contrastive divergence, which allows us to still compute a gradient even if we can’t calculate a closed form. However, this might also be challenging for very large datasets, where we would need to make many passes over the data to compute an average gradient using contrastive divergenceKingma, Diederik, and Max Welling. 2014. “Auto-Encoding Variational Bayes.” https://arxiv.org/pdf/1312.6114.pdf. (CD).

If we can’t calculate the distribution of our encoder p(zx)p(z|x) directly, maybe we could optimize an approximation that is close enough—let’s call this q(zx)q(z|x). Then, we can use a measure to determine if the distributions are close enough. One useful measure of closeness is whether the two distributions encode similar information; we can quantify information using the Shannon Information equation:

Consider why this is a good measure: as p(x)p(x) decreases, an event becomes rarer; therefore, observation of the event communicates more information about the system or dataset, leading to a positive value of log(p(x))-\log (p(x)). Conversely, as the probability of an event nears 11, that event encodes less information about the dataset, and the value of log(p(x))-\log (p(x)) becomes 00:

Press + to interact
Shannon information
Shannon information

Therefore, if we want to measure the difference between the information encoded in two distributions, pp and qq, we can use the difference in their information:

Finally, if we want to find the expected difference in information between the distributions for all elements of xx, we can take the average over p(x)p(x):

This quantity is known as the Kullback Leibler (KL) divergence. It has a few interesting properties:

  • It is not symmetric:  ...