Home/Blog/Data Science/Speech emotion recognition: 5-minute guide
Home/Blog/Data Science/Speech emotion recognition: 5-minute guide

Speech emotion recognition: 5-minute guide

Nimra Zaheer
Jun 19, 2023
7 min read
content
Applications that use SER
Example: Speech over text
Formal introduction
Dependence on datasets
A brief history of SERs
Dataset
State-of-the-art SER model
Conclusion
share

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Emotion is a quality most people associate with human beings. Paired with speech, emotions allow many to communicate and articulate their feelings. In this blog, you’ll be introduced to a very interesting problem of speech processing along with its applications, challenges, and a state-of-the-art dataset and model used for this task.

Given that emotions are essential to the human experience of so many, our applications are likely to cater to the needs of our users much better if they can correctly detect what we’re feeling. Speech emotion recognition (SER) is the task of determining the emotion expressed in an audio recording of human speech.

Speech emotion recognition (SER)
Speech emotion recognition (SER)

Applications that use SER

A lot of applications use emotions to enhance existing systems.

Intelligent tutoring systems: SER systems can enable tutors to automatically identify students’ saturation points, confusions, and expressions of boredom to modify content and lecturing styles to maximize learning among students.

Lie detection: Emotions and expressions help detect lies, which can be a valuable tool for law enforcement departments

In-car emotion recognition: Emotion recognition can save us from accidents by detecting drivers’ anger, tiredness, etc.

Customer care: Call centers can read customers’ emotions for better interactivity and increased satisfaction by prioritizing dissatisfied customers.

Robots: Robots also use SER. From the traditional robot voice, “Hello human, how are you?” we now see more naturally voiced and interactive robots.

Smart homes: Emotions like distress and fear can be detected in homes, which can help alert appropriate rescue services.

Gender inequity and hate speech: SER can be used to detect acoustic cues for gender discrimination and hate speech.

Example: Speech over text

An example of associating emotions with a number
An example of associating emotions with a number

Let’s discuss how emotions are essential to context. Consider the phrase written inside the note on the left. Semantically, these are just numbers. However, if we associate them with a scenario where a sale is concerned, they cease to be just numbers. For example, “Is this really worth 1,000?” (anger) or, “Thank God! It’s been reduced to 1,000!” (happiness). Speech can bring text to life when associated with emotions.

Formal introduction

Let UU be a vector of utterances where each instance uiUu_i \in U is an audio clip of length tit_i. Let LL be a vector of labels so that liLl_i\in L is the emotion associated with the input instance uiu_i. Let the training set be VV and the testing set be WW so that V,WUV,W \subseteq U, VW=V \cap W=\emptyset and VW=UV \cup W=U. Our goal is to learn a function F:UL\mathcal{F}:U\rightarrow Lso that the following occurs:

  1. F\mathcal{F} outputs a correct emotion ljl_j for an input occurrence vjVv_j \in Vfor a maximum number of instances in VVwhich denotes training accuracy.

  2. F\mathcal{F} outputs a correct emotion for an input instance wjWw_j \in W that corresponds to an unseen audio clip, assuming the unseen clip is drawn randomly from the distribution of WW (denoting test accuracy).

Dependence on datasets

SER is dependent on the dataset it is being trained on. A good dataset should approximate the dialects, accents, lexical complexity, phoneme coverage, demographics, and diverse emotions of the targeted language.

Once that sample is acquired, the second objective is to learn a function using a representational learning algorithm that accurately recognizes emotion when a random instance is selected from that approximated sample. At this moment, a lot of datasets are available to train SERs. The aim is to employ powerful deep learning models to predict emotions from a speaker-independent audio clip—one in which a speaker is not used in the training process.

A brief history of SERs

SER has been around for over two decades. Applications for human-computer interaction (HCI) in SER first appeared in 2001. SER research covers a variety of topics, all the way from performed and natural datasets to prosodic and spectral acoustic features.

Other resources, along with speech, are also considered for the classification task for emotion recognition using speech, such as text, visuals (images), keystrokes, etc. The next section will focus on a state-of-the-art dataset used for the English language and its model available for public use.

Dataset

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database is the most commonly used emotional speech repository for the English language containing nine emotions. It contains scripted and improvised dyadic sessions between two actors. It is a benchmark for English SER systems, containing 12 hours of audiovisual recordings performed by ten actors with 10,039 utterances.

Three evaluators have annotated each utterance. The emotions include happiness, anger, sadness, frustration, neutral, disgust, fear, excitement, and surprise.

State-of-the-art SER model

SpeechBrain is a deep learning framework designed for speech-processing tasks. It includes various pre-built models that can be used for speech recognition, speech enhancement, speaker recognition, and more. One of the models trained on the IEMOCAP dataset is the SpeechBrain emotion recognition model. It’s based on a wav2vec2.0 model consisting of convolutional neural network (CNN) architecture. It is trained on the acoustic features of speech, such as the Mel-frequency cepstral coefficients (MFCCs), pitch, and energy. The model is trained to predict the speaker's emotional state based on their speech's acoustic features. The SpeechBrain emotion recognition model consists of multiple convolutional layers, fully connected layers, and a softmax activation function. During training, the model is optimized using cross-entropy loss, which measures the difference between predicted and ground-truth emotions. The model is trained using a subset of the IEMOCAP dataset, which includes primary labeled emotional states of the speakers, such as neutral, happiness, sadness, and anger. The average testing accuracy is 75.3% with four emotion classes. The reason why SpeechBrain works well for this task is because it uses advanced techniques such as feature extraction, deep learning models, transfer learning, and data augmentation to achieve state-of-the-art performance on speech emotion recognition tasks. Its success is attributed to the combination of these techniques and its ability to leverage pretrained models.

Note: You can also fine-tune the pre-trained model using a dataset in a different human language and evaluate the results.

Here’s how to use the trained model to run a test audio sample. The following installs SpeechBrain:

pip install speechbrain

The following passes one instance on the pretrained model:

from speechbrain.pretrained.interfaces import foreign_class
classifier = foreign_class(source="speechbrain/emotion-recognition-wav2vec2-IEMOCAP", pymodule_file="custom_interface.py", classname="CustomEncoderWav2vec2Classifier")
out_prob, score, index, text_lab = classifier.classify_file("speechbrain/emotion-recognition-wav2vec2-IEMOCAP/anger.wav")
print(text_lab)
  • Line 1: Import the necessary library for using a pretrained model.

  • Line 2: Specify the model and class name.

  • Line 3: Classify a sample audio file that will return four variables; we are only interested in the predicted label.

  • Line 4: Print the predicted label of the sample.

Conclusion

In this blog, you were introduced to an interesting problem that predicts emotions present in speech. As you can see, the state-of-the-art accuracy for the SpeechBrain SER model is 75.3% for just four emotions in IEMOCAP—neutral, sadness, anger, and happiness. This dataset has diverse emotions, as well as speakers. For more accurate emotion predictions, better speaker-independent and language-agnostic SER models will need to be developed.

If you’re interested in learning more about speech and emotion recognition, look no further! Check out the following project on the Educative platform: Recognize Emotions from Speech using Librosa.