Introduction: Matching Tokenizers and Datasets
Get an overview of what we will cover this chapter.
We'll cover the following
When studying transformer models, we tend to focus on the models’ architecture and the datasets provided to train them. We have explored the original transformer, fine-tuned a BERT-like model, trained a RoBERTa model, explored a GPT-3 model, trained a GPT-2 model, implemented a T5 model, and more. We have also gone through the main benchmark tasks and datasets.
We trained a RoBERTa tokenizer and used tokenizers to encode data. However, we did not explore the limits of tokenizers to evaluate how they fit the models we build. AI is data-driven.
Raffel et al. (2019), like all the authors cited in this course, spent time preparing datasets for transformer models.
Chapter overview
In this chapter, we will go through some of the limits of tokenizers that hinder the quality of downstream transformer tasks. Do not take pretrained tokenizers at face value. We might have a specific dictionary of words we use (advanced medical language, for example) with words not processed by a generic pretrained tokenizer.
We will start by introducing some tokenizer-agnostic best practices to measure the quality of a tokenizer. We will describe basic guidelines for datasets and tokenizers from a tokenization perspective.
Then, we will see the limits of tokenizers with a Word2Vec tokenizer to describe the problems we face with any tokenizing method. The limits will be illustrated with a Python program.
We will continue our investigation by running a GPT-2 model on a dataset containing specific vocabulary with unconditional and conditional samples.
We will go further and see the limits of byte-level BPE methods. We will build a Python program that displays the results produced by a GPT-2 tokenizer and go through the problems that occur during the data encoding process. This will show that the superiority of GPT-3 is not always necessary for common NLP analysis.
However, at the end of the chapter, we will probe a GPT-3 engine with a Part-of-Speech (POS) task to see how much the model understands and if a ready-to-use tokenized dictionary fits our needs.
This chapter covers the following topics:
Get hands-on with 1400+ tech skills courses.