Variational Autoencoder: Theory
Dive into the mathematics behind variational autoencoders.
We'll cover the following
In simple terms, a variational autoencoder is a probabilistic version of autoencoders.
Why?
Because we want to be able to sample from the latent vector () space to generate new data, which is not possible with vanilla autoencoders.
Each latent variable that is generated from the input will now represent a probability distribution (or what we call the posterior distribution denoted as ).
All we need to do is find the posterior or solve the inference problem.
In fact, the encoder will try to approximate the posterior by computing another distribution , known as the variational posterior.
Note that a probability distribution is fully characterized by its parameters. In the case of the Gaussian, these are the mean and the standard deviation .
So it is enough to pass the parameters (mean and the standard deviation ) of the normal probability distribution — denoted as in the decoder — instead of simply passing the latent vector like the simple autoencoder.
Then, the decoder will receive the distribution parameters and try to reconstruct the input x. However, this statement is factually incorrect because you cannot compute the gradients of a constantly changing operation (stochastic). In other words, you cannot backpropagate through a sampling operation. This is exactly the heart of learning to train variational autoencoders.
Let’s see how we can make it possible. (Hint: Check the reparameterization trick section below.)
Train a variational autoencoder
First things first.
Since our goal is for the variational posterior to be as close as possible to the true posterior, the following loss function is used to train the model.
You can find it as ELBO if you search the literature, and it can be derived from some tough math.
- The first term controls how well the VAE reconstructs a data point from a sample of the variational posterior, and it is known as negative reconstruction error.
- The second term controls how close the variational posterior is to the prior .
E is used to denote the expected value or expectation. The expectation of a random variable X is a generalization of the weighted average of X and can be thought of as the arithmetic mean of a large number of X.
KL refers to Kullback–Leibler divergence and, in simple terms, is a measure of how different a probability distribution is from a second one.
In practice, we used closed analytical forms to compute the ELBO:
The reconstruction term can be proved to be when the data points are binary (follow the Bernoulli distribution). This equation is simply the binary cross entropy and can be implemented using torch.nn.BCELoss(reduction='sum')
in Pytorch.
The KL-Divergence also has a closed form if we assume that the prior distribution is a Gaussian. It can be written as , where is the mean and is the variance.
Given that, can you try and implement ELBO from scratch?
Get hands-on with 1300+ tech skills courses.