Machine Translation
Get an overview of machine translation and learn to perform it using Hugging Face.
We'll cover the following
Overview
Another excellent application of NLP is Machine Translation, where a text written in a natural language is automatically translated into another (natural) language. We’ve all experienced a significant and continuous improvement in the results produced by some common translators like Google and Bing. This improvement can be attributed to transformers, bigger datasets, and better models.
Machine translation has been one of the earliest intended applications of AI. Despite its initial success in games and some trivial tasks, AI was unable to perform machine translation and was one significant reason behind the first
Translation with Hugging Face
The translation is another sequence-to-sequence task. We perform the translation in Hugging Face as follows:
from transformers import pipelineen_fr_translator = pipeline("translation_en_to_fr")translator("It's a pleasant day.")
Choose a specific model
The example above uses the default translation model, T5-base
. While T5
is a frequently used model, it is trained in only three languages, and consequently, we need some diverse models.
Hugging Face provides us the luxury of choosing among several translation models as well. As of August 2022, there are 1600+ models on translation alone. These are progressively increasing.
Note: Before using a specific model, we first need to install the text tokenizer/detokenizer
. SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training.
To use a particular model, we specify it as follows:
from transformers import pipelinemandarianModel = "Helsinki-NLP/opus-mt-zh-en" #Most popular translation model on HFtranslator = pipeline("translation", model=mandarianModel)translator("All the variety, all the charm, all the beauty of life is made up of light and shadow.")
Note: Hugging Face allows us to override the default translation model
Non-native languages
These classical models are trained in English to some commonly used Indo-European languages. However, what about a language like Lhasa or even an Indo-European language (like Punjabi or Pashto) with many speakers but handicapped by a lack of trained models?
Fret not! We can use a pre-trained multilingual model in these scenarios and fine-tune it to the desired language.
Datasets
There are a couple of datasets available if we want to train or fine-tune a model of our own:
opus_books
: This is a collection of (copyright-free) books translated in sixteen different languages.code_x_glue_cc_code_to_code_trans
: It provides some functions in Java and C# as a basic example of translation between programming languages.
Examples
Let’s run some working examples to wrap it up: