Speech emotion recognition: 5-minute guide

Emotion is a quality most people associate with human beings. Paired with speech, emotions allow many to communicate and articulate their feelings. In this blog, you’ll be introduced to a very interesting problem of speech processing along with its applications, challenges, and a state-of-the-art dataset and model used for this task.

Formal introduction#

Let $U$ be a vector of utterances where each instance $u_i \in U$ is an audio clip of length $t_i$ . Let $L$ be a vector of labels so that $l_i\in L$ is the emotion associated with the input instance $u_i$ . Let the training set be $V$ and the testing set be $W$ so that $V,W \subseteq U$ , $V \cap W=\emptyset$ and $V \cup W=U$ . Our goal is to learn a function $\mathcal{F}:U\rightarrow L$ so that the following occurs:

$\mathcal{F}$ outputs a correct emotion $l_j$ for an input occurrence $v_j \in V$ for a maximum number of instances in $V$ which denotes training accuracy.
$\mathcal{F}$ outputs a correct emotion for an input instance $w_j \in W$ that corresponds to an unseen audio clip, assuming the unseen clip is drawn randomly from the distribution of $W$ (denoting test accuracy).

Dependence on datasets#

SER is dependent on the dataset it is being trained on. A good dataset should approximate the dialects, accents, lexical complexity, phoneme coverage, demographics, and diverse emotions of the targeted language.

Once that sample is acquired, the second objective is to learn a function using a representational learning algorithm that accurately recognizes emotion when a random instance is selected from that approximated sample. At this moment, a lot of datasets are available to train SERs. The aim is to employ powerful deep learning models to predict emotions from a speaker-independent audio clip—one in which a speaker is not used in the training process.

A brief history of SERs#

Three evaluators have annotated each utterance. The emotions include happiness, anger, sadness, frustration, neutral, disgust, fear, excitement, and surprise.

State-of-the-art SER model#

SpeechBrain is a deep learning framework designed for speech-processing tasks. It includes various pre-built models that can be used for speech recognition, speech enhancement, speaker recognition, and more. One of the models trained on the IEMOCAP dataset is the SpeechBrain emotion recognition model. It’s based on a wav2vec2.0 model consisting of convolutional neural network (CNN) architecture. It is trained on the acoustic features of speech, such as the Mel-frequency cepstral coefficients (MFCCs), pitch, and energy. The model is trained to predict the speaker's emotional state based on their speech's acoustic features. The SpeechBrain emotion recognition model consists of multiple convolutional layers, fully connected layers, and a softmax activation function. During training, the model is optimized using cross-entropy loss, which measures the difference between predicted and ground-truth emotions. The model is trained using a subset of the IEMOCAP dataset, which includes primary labeled emotional states of the speakers, such as neutral, happiness, sadness, and anger. The average testing accuracy is 75.3% with four emotion classes. The reason why SpeechBrain works well for this task is because it uses advanced techniques such as feature extraction, deep learning models, transfer learning, and data augmentation to achieve state-of-the-art performance on speech emotion recognition tasks. Its success is attributed to the combination of these techniques and its ability to leverage pretrained models.

Note: You can also fine-tune the pre-trained model using a dataset in a different human language and evaluate the results.

Here’s how to use the trained model to run a test audio sample. The following installs SpeechBrain:

Line 1: Import the necessary library for using a pretrained model.
Line 2: Specify the model and class name.
Line 3: Classify a sample audio file that will return four variables; we are only interested in the predicted label.
Line 4: Print the predicted label of the sample.

Conclusion#

In this blog, you were introduced to an interesting problem that predicts emotions present in speech. As you can see, the state-of-the-art accuracy for the SpeechBrain SER model is 75.3% for just four emotions in IEMOCAP—neutral, sadness, anger, and happiness. This dataset has diverse emotions, as well as speakers. For more accurate emotion predictions, better speaker-independent and language-agnostic SER models will need to be developed.

If you’re interested in learning more about speech and emotion recognition, look no further! Check out the following project on the Educative platform: Recognize Emotions from Speech using Librosa.

Speech emotion recognition: 5-minute guide

Applications that use SER#

Example: Speech over text#

Formal introduction#

Dependence on datasets#

A brief history of SERs#

Dataset#

State-of-the-art SER model#

Conclusion#