Extracting Embeddings From All Encoder Layers of BERT

We've extract the embeddings obtained from the final encoder layer of the pre-trained model. Now the question is, should we consider the embeddings obtained only from the final encoder layer (final hidden state), or should we also consider the embeddings obtained from all the encoder layers (all hidden states)? Let's explore this.

Let's represent the input embedding layer with h0h_0, the first encoder layer (first hidden layer) with h1h_1, the second encoder layer (second hidden layer) with h2h_2, and so on to the final twelfth encoder layer, h12h_{12}, as shown in the following figure:

Get hands-on with 1400+ tech skills courses.