The DALL·E Model
Learn about the DALL·E transformer, another task-agnostic transformer model that can process images and text.
We'll cover the following
Overview
DALL·E, like CLIP, is a task-agnostic model. CLIP processed text-image pairs. DALL·E processes the text and image tokens differently. DALL·E’s input is a single stream of text and image of 1,280 tokens. 256 tokens are for the text, and 1,024 tokens are used for the image. DALL·E is a foundation model like CLIP.
DALL·E was named after Salvador Dali and Pixar’s WALL-E. When using DALL·E, we enter a text prompt and produce an image. However, DALL·E must first learn how to generate images with text.
DALL·E is a 12-billion-parameter version of GPT-3.
This transformer generates images from text descriptions using a dataset of text-image pairs.
The basic architecture of DALL·E
Unlike CLIP, DALL·E concatenates up to 256 BPE-encoded text tokens with 32×32 = 1,024 image tokens, as shown in the figure below:
Get hands-on with 1400+ tech skills courses.