...

ALBERT

Learn about the ALBERT variant of BERT and the different techniques it uses to reduce the number of parameters.

We'll cover the following...

Cross-layer parameter sharing
Factorized embedding parameterization

We will start with learning how A Lite version of BERT model (ALBERT) works. One of the challenges with BERT is that it consists of millions of parameters. BERT-base consists of 110 million parameters, which makes it harder to train, and it also has a high inference time. Increasing the model size gives us good results but it puts a limitation on the computational resources. To combat this, ALBERT was introduced. ALBERT is a lite version of BERT with fewer parameters compared to BERT. It uses the following two techniques to reduce the number of parameters:

Cross-layer parameter sharing
Factorized embedding layer parameterization

By using the preceding two techniques, we can reduce the training time and inference time of the BERT model. First, let's understand how these two techniques work in detail, and then we will see how ALBERT is pre-trained.

Cross-layer parameter sharing

Cross-layer parameter sharing is an interesting method for reducing the number of parameters of the BERT model. We know that BERT consists of $N$ number of encoder layers. For instance, BERT-base consists of 12 encoder layers. During training, we learn the parameters of all the encoder layers. But with cross-layer parameter sharing, instead of learning the parameters of all the encoder layers, we only learn the parameters of the first encoder layer, and then we just share the parameters of the first encoder layer with all the other encoder layers. Let's explore this in detail.

The following figure shows the BERT model with $N$ number of encoder layers; only the first encoder layer is expanded to reduce the clutter:

Press + to interact

We know that each encoder layer is identical; that is, each encoder consists of sublayers called multi-head attention and feedforward layers. We can learn the parameters of encoder 1 and share the parameters with all other encoders. This is known as cross-layer parameter sharing. We have several options for performing cross-layer parameter sharing, as listed here:

All-shared: In all-shared, we share the parameters of all the sublayers of the first encoder with all the sublayers of the other encoders.
Shared feedforward network: Here, we only share the parameters of the feedforward network of the first encoder layer with the feedforward network of the other encoder layers.
Shared attention: In this option, we only share the parameters of the multi-head attention of the first encoder layer with the multi-head attention of other encoder layers.

Note: By default, ALBERT uses the all-shared option, that is, we share parameters of the first encoder layer with all the layers.

Now that we have learned how the cross-layer parameter sharing technique works, let's look into another interesting parameter reduction technique.

Factorized embedding parameterization

In BERT, we use the WordPiece tokenizer and create WordPiece tokens. The embedding size of the WordPiece tokens is set the same as the hidden layer embedding size (representation size). A WordPiece embedding is the non-contextual representation, and it is learned from the one-hot-encoded vectors of vocabulary. Hidden layer embedding is the contextual representation and it is returned by the encoder.

Let's denote the vocabulary size as $V$ . We learned that the vocabulary size of BERT is 30,000. Let's denote the hidden layer embedding size as $H$ and the WordPiece embedding size as ...

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

ALBERT

Cross-layer parameter sharing

Factorized embedding parameterization