We will build KantaiBERT from scratch in 14 steps and run it on an MLM example.

The titles of the 14 steps of this section are similar to the titles of the notebook cells, which makes them easy to follow.

In this lesson, we will discuss the first seven steps in detail.

Let’s start by loading the dataset.

Step 1: Loading the dataset

Ready-to-use datasets provide an objective way to train and compare transformers. We explore several datasets later—this chapter aims to understand the training process of a transformer with notebook cells that can be run in real time without waiting for hours to obtain a result.

We'll use the works of Immanuel Kant (1724-1804), the German philosopher who was the epitome of the Age of Enlightenment. The idea is to introduce human-like logic and pretrained reasoning for downstream reasoning tasks.

Project Gutenberg offers a wide range of free eBooks that can be downloaded in text format. You can use other books if you want to create customized datasets of your own based on books.

The following three books by Immanuel Kant have been compiled into a text file named kant.txt:

  • The Critique of Pure Reason

  • The Critique of Practical Reason

  • Fundamental Principles of the Metaphysic of Morals

The kant.txt file provides a small training dataset to train the transformer model of this section. The result obtained remains experimental. For a real-life project, we might add the complete works of Immanuel Kant, Rene Descartes, Pascal, and Leibnitz, for example.

The text file contains the raw text of the books:

Get hands-on with 1200+ tech skills courses.