Paired Style Transfer Using pix2pix GAN
Learn about a variant of conditional GANs used in the context of style transfer.
We'll cover the following...
Style transfer is an intriguing research area that pushes the boundaries of creativity and deep learning together. In their work, “Image-to-Image Translation with Conditional Adversarial
It is called pair-wise style transfer, as the training set needs to have matching samples from both source and target domains. This generic approach is shown to effectively synthesize high-quality images from label maps and edge maps, and even colorize images. The authors highlight the importance of developing an architecture capable of understanding the dataset at hand and learning mapping functions without the need for hand-engineering (which has been the case typically).
The U-Net generator
CNNs are optimized for computer vision tasks, using them for the generator as well as discriminator architectures has a number of advantages. This work focuses on two related architectures for the generator setup. The two choices are the vanilla encoder-decoder architecture and the encoder-decoder architecture with skip connections. The architecture with skip connections has more in common with the U-Net
A typical encoder (in the encoder-decoder setup) takes an input and passes it through a series of downsampling layers to generate a condensed vector form. This condensed vector is termed the bottleneck feature. The decoder part then upsamples the bottleneck features to generate the final output. This setup is extremely useful in a number of scenarios, such as language translation and image reconstruction. The bottleneck features condense the overall input into a lower-dimensional space.
Theoretically, the bottleneck features capture all the required information, but practically, this becomes difficult when the input space is large enough.
Additionally, for our task of image-to-image translation, there are a number of important features that need to be consistent between the input and output images. For example, if we are training our GAN to generate aerial photos out of outline maps, the information associated with roads, water bodies, and other low-level information needs to be preserved between inputs and outputs, as shown below.
The U-Net architecture uses skip connections to shuttle important features between the input and output (see the figures above). In the case of the pix2pix GAN, skip connections are added between every ith down-sampling layer and
The generator presented in the paper follows a repeating block structure for both encoder and decoder parts. Each encoder block consists of a convolutional layer followed by a batch normalization layer, a dropout layer, and leaky ReLU activation. Every such block downsamples by a factor of