...
/The BLEU Score: Evaluating Machine Translation Systems
The BLEU Score: Evaluating Machine Translation Systems
Learn how the BLEU score is used to evaluate machine translation systems.
We'll cover the following...
BLEU stands for “bilingual evaluation un.derstudy” and is a way of automatically evaluating machine translation systems. This metric was first introduced in the paper
Let’s consider an example to learn the calculations of the BLEU score. Say we have two candidate sentences (that is, sentences predicted by our MT system) and a reference sentence (that is, the corresponding actual translation) for some given source sentence:
Reference 1: The cat sat on the mat.
Candidate 1: The cat is on the mat.
To see how good the translation is, we can use the precision measure. Precision is a measure of how many words in the candidate are actually present in the reference. In general, if we consider a classification problem with two classes (denoted by negative and positive), precision is given by the following formula:
Let’s now calculate the precision for candidate 1:
Mathematically, this can be given by the following formula:
This is also known as 1-gram precision since we consider a single word at a time.
Now, let’s introduce a new candidate:
Candidate 2: The the the cat cat cat.
It’s not hard for a human to see that candidate 1 is far better than candidate 2. Let’s calculate the precision:
As we can see, the precision score disagrees with the judgment we made. Therefore, precision alone can’t be trusted to be a good measure of the quality of a translation.
Modified precision
To address the precision limitation, we can use a modified 1-gram precision. The modified ...