Text-to-Image Synthesis

Utilize neural network synthesis to transform text descriptions into images.

Text-to-image synthesis consists of synthesizing an image that satisfies the specifications described in a text sentence. Text-to-image synthesis can be interpreted as a translation problem where the domain of the source and the target are not the same.

In this approach, the problem of text-to-image synthesis is tackled by solving two subproblems. The first relates to learning a representation of text that encodes the visual specifications described within the text. The second relates to learning a model that is capable of using the text representation learned to synthesize images that satisfy the specifications described in the text.

For example, consider this text description: the petals on this flower are white with a yellow center.

Although broad and not defining many aspects of the target flower, this description provides a few hard specifications about the flower:

  • The petals are white

  • The center is yellow

Historical perspective

In early computer vision, this type of representation would be encoded in hand-engineered attribute representations (Farhadi et al. 2009; Kumar et al. 2009). These attribute representations were normally used to enable zero-shot visual recognition (Fu et al. 2014; Akata et al. 2015) and conditional image generation (Yan et al. 2015).

Like most hand-engineered approaches, describing attribute representations by hand is cumbersome and requires domain-specific knowledge. Fortunately, it is possible to learn a vector representation of a text description of an image, therefore, allowing us to easily manipulate the vector representation of the image by simply modifying the text.

Lately, neural networks have been used to obtain vector representations from words and characters in an unsupervised manner. A related approach has been used to learn discriminative and generalizable text representations for images, as seen in the work of Reed et al. (2016).

Generative models for image synthesis

There’s an excellent body of work on generative models for image synthesis.

The text-to-image approach combines the work of natural language processing with the work of generative models for image synthesis to produce images based on a text description.

Conditional GANs have been investigated before in the work of Denton et al. (2016), in which the authors explored image synthesis conditioned on class labels instead of text descriptions. Radford et al. (2016) were able to achieve impressive results using vector arithmetic. Nonetheless, as the authors of “Generative Adversarial Text-to-Image Synthesis” mention, their paper is the first end-to-end differentiable architecture from the character level to the pixel level.

The following imageSource: Generative Adversarial Text-to-Image Synthesis (https://arxiv.org/abs/1605.05396) from the paper “Generative Adversarial Text-to-Image Synthesis” shows the efficiency of their baseline and improved models regarding image synthesis conditioned on text:

Get hands-on with 1400+ tech skills courses.