...

/

Understanding Neural Machine Translation

Understanding Neural Machine Translation

Learn the workings of neural machine translation.

Now that we have an appreciation for how MT has evolved over time, let’s try to understand how state-of-the-art NMT works. First, we’ll take a look at the model architecture used by neural machine translators and then move on to understanding the actual training algorithm.

Intuition behind NMT systems

First, let’s understand the intuition underlying an NMT system’s design. Say we’re fluent English and German speakers and were asked to translate the following sentence into German:

I went home.

This sentence translates to the following:

Ich ging nach Hause.

Although it might not have taken more than a few seconds for a fluent person to translate this, there is a certain process that produces the translation. First, we read the English sentence, and then we create a thought or concept about what this sentence represents or implies in our mind. And finally, we translate the sentence into German. The same idea is used for building NMT systems (see figure below). The encoder reads the source sentence (that is, similar to reading the English sentence). Then, the encoder outputs a context vector (the context vector corresponds to the thought or concept we imagined after reading the sentence). Finally, the decoder takes in the context vectors and outputs the translation in German:

Press + to interact
Conceptual architecture of an NMT system
Conceptual architecture of an NMT system

NMT architecture

Now, we’ll look at the architecture in more detail. The sequence-to-sequence approach was originally proposed by Sutskever, Vinyals, and Le in their paper Sequence to Sequence Learning with Neural Networks Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (3104-3112) .

From the diagram in the figure above, we can see that there are two major components in the NMT architecture. These are called the encoder and decoder. In other words, NMT can be seen as an encoder-decoder architecture. The encoder converts a sentence from a given source language into a thought vector (i.e., a contextualized representation), and the decoder decodes or translates the thought into a target language.

As we can see, this shares some features with the interlingual machine translation method we briefly talked about. This explanation is illustrated in the figure below. The left-hand side of the context vector denotes the encoder (which takes a source sentence word by word to train a time-series model). The right-hand side denotes the decoder, which outputs word by word (while using the previous word as the current input) the corresponding translation of the source sentence. We’ll also use embedding layers (for both the source and target languages) where the semantics of the individual tokens will be learned and fed as inputs to the models:

Press + to interact
Unrolling the source and target sentences over time
Unrolling the source and target sentences over time

With a basic understanding of what NMT looks like, let’s formally define the objective of the NMT. The ultimate objective of an NMT system is to maximize the log likelihood given a source sentence xsx_s and its corresponding yty_t. That is, to maximize the following:

Here, NN refers to the number of source and target sentence inputs we have as training data.

Then, during inference, for a given source sentence xSinferx_S^{infer} ...