Summary: Fine-Tuning BERT Models

Get a quick recap of what we covered this chapter.

BERT brings bidirectional attention to transformers. Predicting sequences from left to right and masking the future tokens to train a model has serious limitations. If the masked sequence contains the meaning we are looking for, the model will produce errors. BERT attends to all of the tokens of a sequence at the same time.

We explored the architecture of BERT, which only uses the encoder stack of transformers. BERT was designed as a two-step framework. The first step of the framework is to pretrain a model. The second step is to fine-tune the model. We built a fine-tuning BERT model for an acceptability judgment downstream task. The fine-tuning process went through all phases of the process. First, we loaded the dataset and the necessary pretrained modules of the model. Then, the model was trained, and its performance was measured.

Get hands-on with 1200+ tech skills courses.