Use Case: Implementing BERT
Learn to implement BERT to answer questions.
To use a pretrained transformer model from the Hugging Face repository, we need three components:
Tokenizer
: Responsible for splitting a long bit of text (such as a sentence) into smaller tokens.config
: Contains the configuration of the model.Model
: Takes in the tokens, looks up the embeddings, and produces the final outputs using the provided inputs.
We can ignore the config
because we’re using the pretrained model as is. However, to show all aspects of this process, we’ll use the configuration nevertheless.
Implementing and using the tokenizer
First, we’ll look at how to download the tokenizer. We can do this using the transformers
library. Simply call the from_pretrained()
function provided by the PreTrainedTokenizerFast
base class:
from transformers import BertTokenizerFasttokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
We’ll be using a tokenizer called bert-base-uncased
. It’s the tokenizer developed for the BERT base model and is uncased (that is, there’s no distinction between uppercase and lowercase characters). Next, let’s see the tokenizer in action:
context = "This is the context"question = "This is the question"token_ids = tokenizer(text=context, text_pair=question, padding=False, return_tensors='tf')print(token_ids)
Let’s look at the arguments we’ve provided to the tokenizer’s call:
text
: A single or batch of text sequences to be encoded by the tokenizer. Each text sequence is a string.text_pair
: An optional single or batch of text sequences to be encoded by the tokenizer. It’s useful in situations where the model takes a multipart input (such as a question and a context in question answering).padding
: Indicates the padding strategy. If set toTrue
, it will be padded to the maximum sequence length in the dataset. If set tomax_length
, it will be padded to the length specified by themax_length
argument. If set toFalse
, no padding will be done.return_tensors
: An argument that defines the ...