Best Practices

Learn about the best practices involved when dealing with datasets and their relevant tokenizers.

Our first step will be to explore the text-to-text methodology defined by Raffel et al. (2019).

Introduction

Downloading benchmark datasets to train transformers has many advantages. The data has been prepared, and every research lab uses the same references. Also, the performance of a transformer model can be compared to another model with the same data.

However, more needs to be done to improve the performance of transformers. Furthermore, implementing a transformer model in production requires careful planning and defining of best practices.

In this lesson, we will define some best practices to avoid critical stumbling blocks. Then we will go through a few examples in Python using cosine similarity to measure the limits of tokenization and encoding datasets.

Raffel et al. (2019) defined a standard text-to-text T5 transformer model. They also went further. They began destroying the myth of using raw data without preprocessing it first. Preprocessing data reduces training time. Common Crawl, for example, contains unlabeled text obtained through web extraction. Non-text and markup have been removed from the dataset.

However, the Google T5 team found that much of the text obtained through Common Crawl did not reach the level of natural language or English. So they decided that datasets need to be cleaned before using them.

We will take the recommendations Raffel et al. (2019) made and apply corporate quality control best practices to the preprocessing and quality control phases. Among many other rules to apply, the examples described show the tremendous work required to obtain acceptable real-life project datasets.

The figure below lists some of the key quality control processes to apply to datasets:

Get hands-on with 1400+ tech skills courses.