Hands-On Generative Adversarial Networks with PyTorch/

...

Introduction to Deep Convolutional GANs

Get a brief overview of deep convolutional GANs.

We'll cover the following...

The architecture of a generator
The architecture of a discriminator

In this chapter, we will introduce a classic and well-performing GAN model called DCGAN to generate 2D images. Deep Convolutional Generative Adversarial Network (DCGAN) is one of the early well-performing and stable approaches to generating images with adversarial training.

Here, even when we only train a GAN to manipulate 1D data, we have to use multiple techniques to ensure stable training. A lot of things could go wrong in the training of GANs. For example, either a generator or a discriminator could overfit if one or the other does not converge. Sometimes, the generator only generates a handful of sample varieties. This is called mode collapse. The following is an example of mode collapse, where we want to train a GAN with some popular meme images in China called Baozou.

Press + to interact

To ensure the stable training of GANs on image data like this, a DCGAN uses three techniques:

Getting rid of fully connected layers and only using convolution layers
Using strided convolution layers to perform downsampling instead of using pooling layers
Using ReLU/LeakyReLU activation functions instead of tanh between hidden layers

In this section, we will introduce the architectures of the generator and discriminator of the DCGAN and learn how to generate images with it. We’ll use MNISTThe MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. samples to illustrate the architecture of a DCGAN and use it to train the model.

The architecture of a generator

The generator network of a DCGAN contains 4 hidden layers (we treat the input layer as the 1^st hidden layer for simplicity) and 1 output layer. Transposed convolution layers are used in hidden layers, which are followed by batch normalization layers and ReLU activation functions. The output layer is also a transposed convolution layer, and tanh is used as the activation function. The architecture of the generator is shown in the following diagram:

Press + to interact

The 2^nd, 3^rd, and 4^th hidden layers and the output layer have a stride value of 2. The 1^st layer has a padding value of 0, and the other layers have a padding value of 1. As the image sizes (Feature maps) increase by two in deeper layers, the number of channels decreases by half. This is a common convention in the architecture design of neural networks. All kernel sizes of transposed convolution layers are set to $4 \times 4$ . The output channel can be either 1 or 3, depending on whether we want to generate grayscale images or color images.

The transposed convolution layer can be considered as the reverse process of a normal convolution. It was once called by some a deconvolution layer, which is misleading because the transposed convolution is not the inverse of convolution. Most convolution layers are not invertible because they are ill-conditionedHave extremely large condition numbers from the linear algebra perspective, which makes their pseudoinverse matrices unfit for representing the inverse process.

The architecture of a discriminator

The discriminator network of a DCGAN consists of 4 hidden layers (again, we treat the input layer as the 1^sthidden layer) and 1 output layer. Convolution layers are used in all layers, which are followed by batch normalization layers, except that the first layer does not have batch normalization. LeakyReLU activation functions are used in the hidden layers, and sigmoid is used for the output layer. The architecture of the discriminator is shown in the following:

Press + to interact

The input channel can be either 1 or 3, depending on whether we are dealing with grayscale images or color images. All hidden layers have a stride value of 2 and a padding value of 1 so that their output image sizes will be half the input images. As image sizes increase in deeper layers, the number of channels increases by twice. All kernels in convolution layers are of a size of $4\times4$ . The output layer has a stride value of 1 and a padding value of 0. It maps $4\times4$ feature maps to single values so that the sigmoid function can transform the value into prediction confidence.

Getting Started

Generative Adversarial Networks Fundamentals

Best Practices for Model Design and Training

Building Our First GAN with PyTorch

Generating Images Based on Label Information

Image-to-Image Translation and Its Applications

Image Restoration with GANs

Training GANs to Break Different Models

Image Generation from Description Text

Sequence Synthesis with GANs

Reconstructing 3D Models with GANs

Concluding Remarks

Appendix

Introduction to Deep Convolutional GANs

The architecture of a generator

The architecture of a discriminator