Introduction to Deep Convolutional GANs

Get a brief overview of deep convolutional GANs.

In this chapter, we will introduce a classic and well-performing GAN model called DCGAN to generate 2D images. Deep Convolutional Generative Adversarial Network (DCGAN) is one of the early well-performing and stable approaches to generating images with adversarial training.

Here, even when we only train a GAN to manipulate 1D data, we have to use multiple techniques to ensure stable training. A lot of things could go wrong in the training of GANs. For example, either a generator or a discriminator could overfit if one or the other does not converge. Sometimes, the generator only generates a handful of sample varieties. This is called mode collapse. The following is an example of mode collapse, where we want to train a GAN with some popular meme images in China called Baozou.

Press + to interact
Some samples from the Baozou dataset
Some samples from the Baozou dataset

We can see that our GAN is only capable of generating one or two memes at a time. Problems that commonly occur in other machine learning algorithms, such as gradient vanishing/explosion and underfitting, can also look familiar in the training of GANs. Therefore, just replacing 1D data with 2D images won’t easily guarantee successful training:

Press + to interact
Mode collapse in GAN training (left: results at 492nd iteration; right: results at 500th iteration)
Mode collapse in GAN training (left: results at 492nd iteration; right: results at 500th iteration)

To ensure the stable training of GANs on image data like this, a DCGAN uses three techniques:

  • Getting rid of fully connected layers and only using convolution layers

  • Using strided convolution layers to perform downsampling instead of using pooling layers

  • Using ReLU/LeakyReLU activation functions instead of tanh between hidden layers

In this section, we will introduce the architectures of the generator and discriminator of the DCGAN and learn how to generate images with it. We’ll use MNISTThe MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. samples to illustrate the architecture of a DCGAN and use it to train the model.

The architecture of a generator

The generator network of a DCGAN contains 4 hidden layers (we treat the input layer as the 1st hidden layer for simplicity) and 1 output layer. Transposed convolution layers are used in hidden layers, which are followed by batch normalization layers and ReLU activation functions. The output layer is also a transposed convolution layer, and tanh is used as the activation function. The architecture of the generator is shown in the following diagram:

Press + to interact
Generator architecture in DCGAN
Generator architecture in DCGAN

The 2nd, 3rd, and 4th hidden layers and the output layer have a stride value of 2. The 1st layer has a padding value of 0, and the other layers have a padding value of 1. As the image sizes (Feature maps) increase by two in deeper layers, the number of channels decreases by half. This is a common convention in the architecture design of neural networks. All kernel sizes of transposed convolution layers are set to 4×44 \times 4. The output channel can be either 1 or 3, depending on whether we want to generate grayscale images or color images.

The transposed convolution layer can be considered as the reverse process of a normal convolution. It was once called by some a deconvolution layer, which is misleading because the transposed convolution is not the inverse of convolution. Most convolution layers are not invertible because they are ill-conditionedHave extremely large condition numbers from the linear algebra perspective, which makes their pseudoinverse matrices unfit for representing the inverse process.

The architecture of a discriminator

The discriminator network of a DCGAN consists of 4 hidden layers (again, we treat the input layer as the 1st hidden layer) and 1 output layer. Convolution layers are used in all layers, which are followed by batch normalization layers, except that the first layer does not have batch normalization. LeakyReLU activation functions are used in the hidden layers, and sigmoid is used for the output layer. The architecture of the discriminator is shown in the following:

Press + to interact
Discriminator architecture in DCGAN
Discriminator architecture in DCGAN

The input channel can be either 1 or 3, depending on whether we are dealing with grayscale images or color images. All hidden layers have a stride value of 2 and a padding value of 1 so that their output image sizes will be half the input images. As image sizes increase in deeper layers, the number of channels increases by twice. All kernels in convolution layers are of a size of 4×44\times4. The output layer has a stride value of 1 and a padding value of 0. It maps 4×44\times4 feature maps to single values so that the sigmoid function can transform the value into prediction confidence.