The Cross-Lingual Language Model (XLM)
Learn about the XLM model, including its training dataset, different pre-training strategies, and the process for both pre-training and evaluation.
The M-BERT model is pre-trained just like the regular BERT model, without any specific cross-lingual objective. In this lesson, let's learn how to pre-train BERT with a cross-lingual objective. We refer to BERT trained with a cross-lingual objective as a cross-lingual language model (XLM). The XLM model performs better than M-BERT, and it learns cross-lingual representations.
Training dataset
The XLM model is pre-trained using the monolingual and parallel datasets. The parallel dataset consists of text in a language pair; that is, it consists of the same text in two different languages. Say we have an English sentence, and then we will have a corresponding sentence in another language, French, for example. We can call this parallel dataset a cross-lingual dataset.
The monolingual dataset is obtained from Wikipedia, and the parallel dataset is obtained from several sources, including MultiUN (a multilingual corpus from United Nations documents), OPUS ( the open parallel corpus), and the IIT Bombay corpus. XLM uses byte pair encoding (BPE) and creates shared vocabulary across all languages.
Pre-training strategies
The XLM model is pre-trained using the following tasks:
Causal language modeling
Masked language modeling
Translation language modeling
Let's take a look at how each of the preceding tasks works.
Causal language modeling
Causal language modeling (CLM) is the simplest pre-training method. In CLM, the goal of our model is to predict the probability of a word given the previous set of words. It is represented as