ALBERT
Learn about the ALBERT variant of BERT and the different techniques it uses to reduce the number of parameters.
We will start with learning how A Lite version of BERT model (ALBERT) works. One of the challenges with BERT is that it consists of millions of parameters. BERT-base consists of 110 million parameters, which makes it harder to train, and it also has a high inference time. Increasing the model size gives us good results but it puts a limitation on the computational resources. To combat this, ALBERT was introduced. ALBERT is a lite version of BERT with fewer parameters compared to BERT. It uses the following two techniques to reduce the number of parameters:
Cross-layer parameter sharing
Factorized embedding layer parameterization
By using the preceding two techniques, we can reduce the training time and inference time of the BERT model. First, let's understand how these two techniques work in detail, and then we will see how ALBERT is pre-trained.
Cross-layer parameter sharing
Cross-layer parameter sharing is an interesting method for reducing the number of parameters of the BERT model. We know that BERT consists of
The following figure shows the BERT model with
Get hands-on with 1400+ tech skills courses.