Introduction to Deep Convolutional GANs
Get a brief overview of deep convolutional GANs.
We'll cover the following
In this chapter, we will introduce a classic and well-performing GAN model called DCGAN to generate 2D images. Deep Convolutional Generative Adversarial Network (DCGAN) is one of the early well-performing and stable approaches to generating images with adversarial training.
Here, even when we only train a GAN to manipulate 1D data, we have to use multiple techniques to ensure stable training. A lot of things could go wrong in the training of GANs. For example, either a generator or a discriminator could overfit if one or the other does not converge. Sometimes, the generator only generates a handful of sample varieties. This is called mode collapse. The following is an example of mode collapse, where we want to train a GAN with some popular meme images in China called Baozou.
We can see that our GAN is only capable of generating one or two memes at a time. Problems that commonly occur in other machine learning algorithms, such as gradient vanishing/explosion and underfitting, can also look familiar in the training of GANs. Therefore, just replacing 1D data with 2D images won’t easily guarantee successful training:
To ensure the stable training of GANs on image data like this, a DCGAN uses three techniques:
Getting rid of fully connected layers and only using convolution layers
Using strided convolution layers to perform downsampling instead of using pooling layers
Using
ReLU
/LeakyReLU
activation functions instead oftanh
between hidden layers
In this section, we will introduce the architectures of the generator and discriminator of the DCGAN and learn how to generate images with it. We’ll use
The architecture of a generator
The generator network of a DCGAN contains 4 hidden layers (we treat the input layer as the 1st hidden layer for simplicity) and 1 output layer. Transposed convolution layers are used in hidden layers, which are followed by batch normalization layers and ReLU
activation functions. The output layer is also a transposed convolution layer, and tanh
is used as the activation function. The architecture of the generator is shown in the following diagram:
The 2nd, 3rd, and 4th hidden layers and the output layer have a stride value of 2. The 1st layer has a padding value of 0, and the other layers have a padding value of 1. As the image sizes (Feature maps) increase by two in deeper layers, the number of channels decreases by half. This is a common convention in the architecture design of neural networks. All kernel sizes of transposed convolution layers are set to
The transposed convolution layer can be considered as the reverse process of a normal convolution. It was once called by some a deconvolution layer, which is misleading because the transposed convolution is not the inverse of convolution. Most convolution layers are not invertible because they are
The architecture of a discriminator
The discriminator network of a DCGAN consists of 4 hidden layers (again, we treat the input layer as the 1st hidden layer) and 1 output layer. Convolution layers are used in all layers, which are followed by batch normalization layers, except that the first layer does not have batch normalization. LeakyReLU
activation functions are used in the hidden layers, and sigmoid
is used for the output layer. The architecture of the discriminator is shown in the following:
The input channel can be either 1 or 3, depending on whether we are dealing with grayscale images or color images. All hidden layers have a stride value of 2 and a padding value of 1 so that their output image sizes will be half the input images. As image sizes increase in deeper layers, the number of channels increases by twice. All kernels in convolution layers are of a size of sigmoid
function can transform the value into prediction confidence.