...

/

Learning Multilingual Embeddings Through Knowledge Distillation

Learning Multilingual Embeddings Through Knowledge Distillation

Learn how to apply Student-BERT for languages other than English using the teacher-student architecture.

Let's understand how to make the monolingual sentence embedding multilingual through knowledge distillation. We learned how M-BERT, XLM, and XLM-R work and how they produce representations for different languages. In all these models, the vector space between languages is not aligned. That is, the representation of the same sentence in different languages will be mapped to different locations in the vector space. Now, we will see how to map similar sentences in different languages to the same location in the vector space.

We learned how Sentence-BERT works. We learned how Sentence-BERT generates the representation of a sentence. But how do we use the Sentence-BERT for languages other than English?

Sentence-BERT for other languages

We can apply Sentence-BERT for different languages by making the monolingual sentence embedding generated by Sentence-BERT multilingual through knowledge distillation. To do this, we transfer the knowledge of Sentence-BERT to any multilingual model, say, XLM-R, and make the multilingual model generate embeddings just like pre-trained Sentence-BERT. Let's explore this in more detail.

The XLM-R model generates embeddings for 100 different languages. Now, we take the pre-trained XLM-R model and teach the XLM-R model to generate sentence embeddings for different languages just like Sentence-BERT. We use the pre-trained Sentence-BERT as the teacher and the pre-trained XLM-R as the student model.

Say we have a source sentence in English and the corresponding target sentence in French: [How are you, Comment a va]. First, we will feed the source sentence to the teacher (Sentence-BERT) and get the sentence representation. Next, we feed both the source and target sentences to the student (XLM-R) and get the sentence representations, as shown in the following figure:

Press + to interact
Teacher-student architecture
Teacher-student architecture

Now we have sentence representations generated by the teacher and student models. We can observe that the source sentence representations generated by the teacher and the student are different. We need to teach our student model (XLM-R) to generate representations similar to the teacher model. In order to do that, we compute the mean squared difference between the source sentence representation generated by the teacher and the source sentence representation generated by the student. Then, we train the student network to minimize the mean squared error (MSE).

As shown in the following figure, in order to make our student generate representations the same as the teacher, we compute the MSE between the source sentence representation returned by the teacher and the source sentence representation returned by the student:

Press + to interact
Computing the MSE loss between the source sentence representation by teacher and student
Computing the MSE loss between the source sentence representation by teacher and student

We also need to compute the MSE between the source sentence representation returned by the teacher and the target sentence representation returned by the student. But why? The reason is that the target French sentence is the equivalent of the source English sentence. So, we need our target sentence representation to be the same as the source sentence representation returned by the teacher. So, we compute the MSE between the source sentence representation returned by the teacher and the target sentence representation returned by the student, as shown in the following figure:

Press + to interact
Computing the MSE loss between the source sentence representation by teacher and the target sentence representation by student
Computing the MSE loss between the source sentence representation by teacher and the target sentence representation by student

After computing the MSE, we train our student network by minimizing it. By minimizing the MSE, our student network will learn how to generate embeddings the same as the teacher network. In this way, we can make our student (XLM-R) generate multilingual embeddings the same as how the teacher (Sentence-BERT) generates monolingual embeddings.

Teacher-student architecture

Let's suppose we have parallel translated source-target sentence pairs as [(s1,t1),(s2,t2),...,(si,ti),...,(sn,tn)][(s_1, t_1), (s_2, t_2), ..., (s_i, t_i), ..., (s_n, t_n)] ...