Preparing the Pretraining Input Environment

Learn to prepare the pretraining input environment when dealing with BERT models.

The BERT model has no decoder stack of layers. Therefore, it does not have a masked multi-head attention sublayer. BERT designers state that a masked multi-head attention layer that masks the rest of the sequence impedes the attention process.

A masked multi-head attention layer masks all of the tokens that are beyond the present position.

For example, take the following sentence:

Get hands-on with 1200+ tech skills courses.