English to Spanish translation with transformer model

Language translation has always been a fascinating challenge in natural language processing (NLP). In this answer, we delve into a project that utilizes the sequence-to-sequence transformer model to perform English to Spanish translation.

Libraries

Here's a quick overview of the libraries we'll be using:

pathlib: This library simplifies working with file paths and directories, making it easier to manage data and files.
random: The random library allows us to introduce randomness in our code, which is essential for tasks like shuffling data.
string: The string library provides a set of useful constants containing various character sets (e.g., letters, digits, punctuation), which can be handy for text processing.
re: The re library supports regular expressions, enabling us to efficiently manipulate and transform text data.
numpy: The numpy library is a fundamental package for numerical computations in Python. It offers powerful array operations and mathematical functions.
tensorflow: TensorFlow is a powerful machine learning library that provides tools for building and training neural networks. It includes various functionalities for data manipulation, model creation, and optimization.
keras: Keras is a high-level API that runs on top of TensorFlow (or other backend engines) and simplifies the process of building, training, and evaluating deep learning models.
TextVectorization: The TextVectorization layer streamlines the process of converting text data into numeric format, making it suitable for feeding into neural networks.

Dataset

The backbone of any machine learning project is the data. For our translation task, we used a bilingual dataset, "spa-eng," containing pairs of English and Spanish sentences to train and evaluate our transformer model. Each English sentence was paired with its corresponding Spanish translation. This diverse dataset helps the model learn the intricate nuances of language translation.

Data preprocessing

We conducted a series of preprocessing steps before feeding the data into the model. This involved tokenization, adding special tokens like [start] and [end] for the decoder inputs, and standardizing the text by converting it to lowercase and removing unnecessary characters.

Model architecture

Transformer architecture

Our model's core architecture is based on the transformer, a revolutionary model introduced in "Attention is All You Need" by Vaswani et al. This introduces the concept of self-attention, allowing the model to capture relationships between words across the entire sentence, eliminating the dependence on sequential order. This enables better understanding and translation of context-rich sentences.

The transformer model architecture is depicted in the following illustration:

Positional embeddings

Positional embedding is a critical part of the transformer architecture. It ensures the model understands the sequence order by adding positional information to the token embeddings. This is important because transformers don't inherently have any built-in notion of order due to their attention mechanism.

Encoder and decoder layers

The transformer model architecture consists of two main components: the encoder and the decoder, each utilizing multi-head self-attention and feedforward neural networks. These components work together to process input sequences and generate translated output sequences. The encoder processes the input sentence and generates contextual representations for each word. The decoder then generates the target translation word by word, with the help of both the representation from the encoder and the previously generated words.

In this section, we delve into the workings of the encoder and decoder stack, exploring how the architecture processes input sequences and generates meaningful translations.

The process begins with the word embeddings of the input sequence, which represents the English sentence. These embeddings capture the semantic meaning of each word.
The word embeddings are then fed into the first encoder in the encoder stack. Each encoder in this stack processes the embeddings by applying self-attention, which enables the model to understand the relationships between words.
After the first encoder transforms the embeddings, the output is propagated to the next encoder in the stack. Each subsequent encoder builds upon the representations generated by the previous encoder, allowing the model to capture increasingly complex context and meaning.
This sequential transformation and propagation continue as the input embeddings move through each encoder in the stack. The encoders work collaboratively to distill the context and information present in the input sequence.
Once the entire encoder stack has processed the input embeddings, the output from the last encoder is obtained. This output encapsulates the comprehensive understanding of the input sequence's context and meaning.
This final output from the encoder stack is then used as the initial context for all the decoders in the decoder stack. The decoders utilize this contextual information to generate the corresponding translated output.

Model creation

The encoder and decoder are integrated into the complete transformer model. During training, the model takes both the source (English) and target (Spanish) sequences as inputs and generates the translated Spanish sequence. The model is trained to minimize the difference between the predicted and actual translations.

Model training

Training configuration

We configure the model for training by specifying hyperparameters, optimizers, loss functions, and metrics. The transformer architecture is then compiled to begin the training process.

Training process

During training, the model learned to align English and Spanish words, capturing semantic meanings and monitoring validation accuracy to ensure the model's generalization to unseen data. The model computes the loss between its predictions and the actual translations. The optimizer updates the model's parameters to minimize this loss, improving its translation accuracy.

Epochs

Training occurs over multiple epochs, where each epoch involves going through the entire training dataset once. The model learns to generate accurate translations as training progresses, and the loss generally decreases. We trained our model on the "spa-eng" dataset for 30 epochs. Adjusting weights and learning patterns from the data continues for several epochs.

Here, we can set the value of epochs to be less than 30. But a lower number leads to reduced model efficiency or accuracy due to a shorter training period.

Validation

To monitor the model's performance and prevent overfitting, we validate the model's accuracy on a separate validation dataset after each epoch. This helps identify where the model starts to perform better on the training data but worse on unseen data (overfitting).

Model inference

Our trained model is a language translation. Let's observe its process in action:

Input: "Good luck!"

English input: "Good luck!"
Tokenization and positional embedding: The input is tokenized and embedded with positional information.
Decoder's turn: The decoder starts with the [start] token and generates translations step by step.
Attention mechanisms: The decoder highlights important words in the input and generated translation.
Predicting tokens: At each step, the model predicts the next token based on its learned context.
Spanish output: The process continues until the model predicts the [end] token, resulting in a Spanish translation.

Model translation

Decoding sentences:

After training, we can put our model to use. Given an English sentence, the model uses a decoding process to generate the corresponding Spanish translation. This is achieved by predicting one word at a time, considering both the input and previously generated words.

Capturing the context:

The transformer's self-attention mechanism shines here. It allows the model to focus on relevant parts of the input sentence while generating each translation word. This enables the model to produce coherent and contextually accurate translations.

Self-attention mechanism

Self-attention helps the model identify relevant parts of the sequence by comparing positions using query, key, and value matrices. By assigning weights to each position, self-attention calculates a weighted sum of the values and creates a context vector representing the sequence's most important parts.

Applications

Following are some of the key areas where this advanced language translation model can make a meaningful impact: