The BLEU Score: Evaluating Machine Translation Systems

Learn how the BLEU score is used to evaluate machine translation systems.

BLEU stands for “bilingual evaluation un.derstudy” and is a way of automatically evaluating machine translation systems. This metric was first introduced in the paper BLEU: A Method for Automatic Evaluation of Machine Translation Papineni, and others, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002 311-318 . We'll be using an implementation of the BLEU score found on GitHub. Let’s learn how this is calculated in the context of MT.

Let’s consider an example to learn the calculations of the BLEU score. Say we have two candidate sentences (that is, sentences predicted by our MT system) and a reference sentence (that is, the corresponding actual translation) for some given source sentence:

  • Reference 1: The cat sat on the mat.

  • Candidate 1: The cat is on the mat.

To see how good the translation is, we can use the precision measure. Precision is a measure of how many words in the candidate are actually present in the reference. In general, if we consider a classification problem with two classes (denoted by negative and positive), precision is given by the following formula:

Get hands-on with 1400+ tech skills courses.