# Automated Speech Recognition (ASR)

As the name suggests, **Automated Speech Recognition** - **ASR** - is a sophisticated software used to interpret spoken words through an input device (mic) or audio file and then output them. ASR relieves users from tedious data entry by enabling them to dictate data to their computer device rather than typing it. Many industries use *ASR* as a daily driver. One of the biggest examples is Amazon's *Alexa*.

# Word Error Rate (WER)

There is a designated metric called **Word Error Rate** - **WER** -  to check the efficiency of different *ASR* software. *WER* is a formula applied to the resulting transcript from an *ASR* software to measure its accuracy. The formula consists of 4 components: 

| Component | Stands For |
| - | - |
| **S** | **Substitution**: The amount of words that need to be substituted to match the original transcript. |
| **D** | **Deletion**: The amount of words dropped from the original transcript. |
| **I** | **Insertion**: The amount of extra words added compared to the original transcript. |
| **N** | **Number**: The Total number of words in the correct transcript. |

By combining the above components, we get the following formula to compute *WER*:

$$ WER = \frac{S + D + I}{N}




Let's look at an example. Suppose the actual phrase *Please turn around* gets converted into *Please burn a round* by some *ASR* software. Here, can notice that:

- The word **turn** got substituted by **burn**. Therefore, we have one *substitution*.
- There are two new words inserted - **a** and **round**. Therefore, we have two *insertions*.
- There is a word *deleted* - **around**. Therefore we have one *deletion*.
- There are three words in total in the original transcript. 

After putting all of this together, the computed *WER* for the conversion above turns out to be:

$$WER = \frac{S + D + I}{N} = \frac{1 + 2 + 1}{3} = 1.333$$

> Fun Fact: Humans have a *WER* of 0.4!

What is Word Error Rate?

Automated Speech Recognition (ASR) As the name suggests, Automated Speech Recognition - ASR - is a sophisticated software used to interpret spoken words through an input device (mic) or audio file and then output them. ASR relieves users from tedious data entry by enabling them to dictate data to their computer device rather than typing it. Many industries use ASR as a daily driver. One of the biggest examples is Amazon s Alexa. 

Component	Stands For
S	Substitution: The amount of words that need to be substituted to match the original transcript.
D	Deletion: The amount of words dropped from the original transcript.
I	Insertion: The amount of extra words added compared to the original transcript.
N	Number: The Total number of words in the correct transcript.