Introduction: From NLP to Task-Agnostic Transformer Models

Get an overview of what we will cover in this chapter.

We'll cover the following

Up to now, we have examined variations of the original transformer model with encoder and decoder layers, and we have explored other models with encoder-only or decoder-only stacks of layers. The size of the layers and parameters also increased. However, the fundamental architecture of the transformer retains its original structure with identical layers and the parallelization of the computing of the attention heads.

Chapter overview

In this chapter, we will explore innovative transformer models that respect the basic structure of the original transformer but make some significant changes. Scores of transformer models will appear, like the many possibilities a box of LEGO© pieces gives. You can assemble those pieces in hundreds of ways! Transformer model sublayers and layers are the LEGO© pieces of advanced AI.

We will begin by asking which transformer model to choose among the many offers and which ecosystem to implement them in.

Then we will discover Locality Sensitivity Hashing (LSH) buckets and chunking in Reformer models. Next, we will learn what disentanglement is in DeBERTa models. DeBERTa also introduces an alternative way of managing positions in the decoder. DeBERTa’s high-powered transformer model exceeds human baselines.

Our last step will be to discover powerful computer vision transformers such as ViT, CLIP, and DALL·E. We can add CLIP and DALL·E to OpenAI GPT-3 and Google BERT (trained by Google) to the very small group of foundation models.

These powerful foundation models prove that transformers are task-agnostic. A transformer learns sequences. These sequences include vision, sound, and any type of data represented as a sequence.

Images contain sequences of data-like language. We will run ViT, CLIP, and DALL·E models to learn. We will take vision models to innovative levels.

Get hands-on with 1200+ tech skills courses.