...

/

The BLEU Score: Evaluating Machine Translation Systems

The BLEU Score: Evaluating Machine Translation Systems

Learn how the BLEU score is used to evaluate machine translation systems.

BLEU stands for “bilingual evaluation un.derstudy” and is a way of automatically evaluating machine translation systems. This metric was first introduced in the paper BLEU: A Method for Automatic Evaluation of Machine Translation Papineni, and others, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002 311-318 . We'll be using an implementation of the BLEU score found on GitHub. Let’s learn how this is calculated in the context of MT.

Let’s consider an example to learn the calculations of the BLEU score. Say we have two candidate sentences (that is, sentences predicted by our MT system) and a reference sentence (that is, the corresponding actual translation) for some given source sentence:

  • Reference 1: The cat sat on the mat.

  • Candidate 1: The cat is on the mat.

To see how good the translation is, we can use the precision measure. Precision is a measure of how many words in the candidate are actually present in the reference. In general, if we consider a classification problem with two classes (denoted by negative and positive), precision is given by the following formula:

Let’s now calculate the precision for candidate 1:

Mathematically, this can be given by the following formula:

This is also known as 1-gram precision since we consider a single word at a time.

Now, let’s introduce a new candidate:

  • Candidate 2: The the the cat cat cat.

It’s not hard for a human to see that candidate 1 is far better than candidate 2. Let’s calculate the precision:

As we can see, the precision score disagrees with the judgment we made. Therefore, precision alone can’t be trusted to be a good measure of the quality of a translation.

Press + to interact

Modified precision

To address the precision limitation, we can use a modified 1-gram precision. The modified ...