What is Word Error Rate?

Share

svg viewer

Automated Speech Recognition (ASR)

As the name suggests, Automated Speech Recognition - ASR - is a sophisticated software used to interpret spoken words through an input device (mic) or audio file and then output them. ASR relieves users from tedious data entry by enabling them to dictate data to their computer device rather than typing it. Many industries use ASR as a daily driver. One of the biggest examples is Amazon’s Alexa.

svg viewer

Word Error Rate (WER)

There is a designated metric called Word Error Rate - WER - to check the efficiency of different ASR software. WER is a formula applied to the resulting transcript from an ASR software to measure its accuracy. The formula consists of 4 components:

Component Stands For
S Substitution: The amount of words that need to be substituted to match the original transcript.
D Deletion: The amount of words dropped from the original transcript.
I Insertion: The amount of extra words added compared to the original transcript.
N Number: The Total number of words in the correct transcript.

By combining the above components, we get the following formula to compute WER:

WER=S+D+IN WER = \frac{S + D + I}{N}

Let’s look at an example. Suppose the actual phrase Please turn around gets converted into Please burn a round by some ASR software. Here, can notice that:

  • The word turn got substituted by burn. Therefore, we have one substitution.
  • There are two new words inserted - a and round. Therefore, we have two insertions.
  • There is a word deleted - around. Therefore we have one deletion.
  • There are three words in total in the original transcript.

After putting all of this together, the computed WER for the conversion above turns out to be:

WER=S+D+IN=1+2+13=1.333WER = \frac{S + D + I}{N} = \frac{1 + 2 + 1}{3} = 1.333

Fun Fact: Humans have a WER of 0.4!

Copyright ©2024 Educative, Inc. All rights reserved