ALBERT
Learn about the ALBERT variant of BERT and the different techniques it uses to reduce the number of parameters.
We will start with learning how A Lite version of BERT model (ALBERT) works. One of the challenges with BERT is that it consists of millions of parameters. BERT-base consists of 110 million parameters, which makes it harder to train, and it also has a high inference time. Increasing the model size gives us good results but it puts a limitation on the computational resources. To combat this, ALBERT was introduced. ALBERT is a lite version of BERT with fewer parameters compared to BERT. It uses the following two techniques to reduce the number of parameters:
Cross-layer parameter sharing
Factorized embedding layer parameterization
By using the preceding two techniques, we can reduce the training time and inference time of the BERT model. First, let's understand how these two techniques work in detail, and then we will see how ALBERT is pre-trained.
Cross-layer parameter sharing
Cross-layer parameter sharing is an interesting method for reducing the number of parameters of the BERT model. We know that BERT consists of
The following figure shows the BERT model with
We know that each encoder layer is identical; that is, each encoder consists of sublayers called multi-head attention and feedforward layers. We can learn the parameters of encoder 1 and share the parameters with all other encoders. This is known as cross-layer parameter sharing. We have several options for performing cross-layer parameter sharing, as listed here:
All-shared: In all-shared, we share the parameters of all the sublayers of the first encoder with all the sublayers of the other encoders.
Shared feedforward network: Here, we only share the parameters of the feedforward network of the first encoder layer with the feedforward network of the other encoder layers.
Shared attention: In this option, we only share the parameters of the multi-head attention of the first encoder layer with the multi-head attention of other encoder layers.
Note: By default, ALBERT uses the all-shared option, that is, we share parameters of the first encoder layer with all the layers.
Now that we have learned how the cross-layer parameter sharing technique works, let's look into another interesting parameter reduction technique.
Factorized embedding parameterization
In BERT, we use the WordPiece tokenizer and create WordPiece tokens. The embedding size of the WordPiece tokens is set the same as the hidden layer embedding size (representation size). A WordPiece embedding is the non-contextual representation, and it is learned from the one-hot-encoded vectors of vocabulary. Hidden layer embedding is the contextual representation and it is returned by the encoder.
Let's denote the vocabulary size as