The design of the GAN model in this section is based on the text-to-image modelReed, Scott, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. "Generative adversarial text to image synthesis." In International conference on machine learning, pp. 1060-1069. PMLR, 2016.. Here, we will describe and define the architectures of the generator and discriminator networks and the training process.
Generator architecture
The generator network has two inputs, including a latent noise vector, z, and the embedding vector, t, of the description sentence. The embedding vector, t, has a length of 1,024, which is mapped by a fully-connected layer to a vector of 128. This vector is concatenated with the noise vector, z, to form a tensor with a size of [B,228,1,1] (in which B represents the batch size and is omitted from now on). Five transposed convolution layers (with a kernel size of 4, a stride size of 2, and a padding size of 1) are used to gradually expand the size of the feature map (while decreasing the channel width) to [3,64,64], which is the generated image after a Tanh
activation function. Batch normalization layers and ReLU
activation functions are used in the hidden layers.
Let’s create a new file named gan.py
to define the networks. Here is the code definition of the generator network: