GAN Architecture and Training

Understand the GAN architecture of the text-to-image model and follow the step-by-step model training process.

The design of the GAN model in this section is based on the text-to-image modelReed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." In International conference on machine learning, pp. 1060-1069. PMLR, 2016.. Here, we will describe and define the architectures of the generator and discriminator networks and the training process.

Generator architecture

The generator network has two inputs, including a latent noise vector, zz, and the embedding vector, tt, of the description sentence. The embedding vector, tt, has a length of 1,024, which is mapped by a fully-connected layer to a vector of 128. This vector is concatenated with the noise vector, zz, to form a tensor with a size of [B,228,1,1][B, 228, 1, 1] (in which BB represents the batch size and is omitted from now on). Five transposed convolution layers (with a kernel size of 4, a stride size of 2, and a padding size of 1) are used to gradually expand the size of the feature map (while decreasing the channel width) to [3,64,64][3, 64, 64], which is the generated image after a Tanh activation function. Batch normalization layers and ReLU activation functions are used in the hidden layers.

Let’s create a new file named gan.py to define the networks. Here is the code definition of the generator network:

Get hands-on with 1400+ tech skills courses.