ALBERT

Learn about the ALBERT variant of BERT and the different techniques it uses to reduce the number of parameters.

We will start with learning how A Lite version of BERT model (ALBERT) works. One of the challenges with BERT is that it consists of millions of parameters. BERT-base consists of 110 million parameters, which makes it harder to train, and it also has a high inference time. Increasing the model size gives us good results but it puts a limitation on the computational resources. To combat this, ALBERT was introduced. ALBERT is a lite version of BERT with fewer parameters compared to BERT. It uses the following two techniques to reduce the number of parameters:

  • Cross-layer parameter sharing

  • Factorized embedding layer parameterization

By using the preceding two techniques, we can reduce the training time and inference time of the BERT model. First, let's understand how these two techniques work in detail, and then we will see how ALBERT is pre-trained.

Cross-layer parameter sharing

Cross-layer parameter sharing is an interesting method for reducing the number of parameters of the BERT model. We know that BERT consists of NN number of encoder layers. For instance, BERT-base consists of 12 encoder layers. During training, we learn the parameters of all the encoder layers. But with cross-layer parameter sharing, instead of learning the parameters of all the encoder layers, we only learn the parameters of the first encoder layer, and then we just share the parameters of the first encoder layer with all the other encoder layers. Let's explore this in detail.

The following figure shows the BERT model with NN number of encoder layers; only the first encoder layer is expanded to reduce the clutter:

Get hands-on with 1400+ tech skills courses.