Transformers

Learn about Hugging Face's transformers library and how to generate and extract the embeddings of the sentence and tokens using transformers.

Hugging Face's transformers

Hugging Face is an organization that is on the path of democratizing AI through natural language. Their open-source transformers library is very popular among the natural language processing (NLP) community. It is very useful and powerful for several NLP and Natural Language Understanding (NLU) tasks. It includes thousands of pre-trained models in more than 100 languages. One of the many advantages of the transformer’s library is that it is compatible with both PyTorch and TensorFlow.

We can install transformers directly using pip as shown here:

pip install transformers==4.30.0

As we can see, we use transformers version 4.30.0. Now that we have installed transformers, let's get started.

Generating BERT embeddings

Consider the sentence 'I love Paris'. Let's see how to obtain the contextualized word embedding of all the words in the sentence using the pre-trained BERT model with Hugging Face's transformers library.

Import the modules

Let's import the necessary modules:

from transformers import BertModel, BertTokenizer
import torch
Importing modules

Download and load the pre-trained model

We download the pre-trained BERT model. We can check all the available pre-trained BERT models here. We use the bert-base-uncased model. As the name suggests, it is the BERT-base model with 12 encoders, and it is trained with uncased tokens. Since we are using BERT-base, the representation size will be 768.

Download and load the pre-trained bert-base-uncased model:

model = BertModel.from_pretrained('bert-base-uncased')

Download and load the tokenizer

We download and load the tokenizer that was used to pre-train the bert-base-uncased model:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Now, let's see how to preprocess the input before feeding it to BERT.

Preprocessing the input

...