Generating Photo-Realistic Images with StackGAN++

Learn about photo-realistic image generation with StackGAN and StackGAN++.

The generation of images from description text can be considered a conditional GAN (CGAN) process in which the embedding vector of the description sentence is used as the additional label information. We need to figure out how to generate large images with CGAN. It’s also possible to stack two CGANs together so that we can get high-quality images. This is exactly what StackGANZhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. "Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks." In Proceedings of the IEEE international conference on computer vision, pp. 5907-5915. 2017. does.

High-resolution text-to-image synthesis with StackGAN

The embedding vector, φt\varphi_t, of the description sentence is processed by the conditioning augmentation step to create a conditional vector, cc. In conditioning augmentation, a pair of mean, μ\mu, and standard deviation, σ\sigma, vectors are calculated from the embedding vector, φt\varphi_t, to generate the conditional vector, cc, based on the Gaussian distribution, N(μ,σ2)\mathcal N(\mu,\sigma^2). This process lets us create many more unique conditional vectors from limited text embeddings and ensure that all the conditional variables obey the same Gaussian distribution. At the same time, μ\mu and σ\sigma are restrained so that they are not too far away from N(o,I)\mathcal N(o,\mathcal I). This is done by adding a Kullback-Leiber divergence (KL divergence) term to the generator’s loss function.

Get hands-on with 1200+ tech skills courses.