Sound waves are the fastest information-rich medium commonly used by humans for communication. Human beings can communicate with each other using speech, but how can machines understand us? Despite the vast difference between machine and human languages, the spoken command “Hey Siri, find this” can be understood by machines. How is this made possible?
Audio waves are vibrations of air molecules that propagate through a medium such as air or water. These waves are created by sound sources, such as musical instruments, speakers, or the human voice, and are perceived by the human ear as sound. Digitizing audio waves refers to converting analog sound waves into digital signals that a machine can store and manipulate. This process involves several steps:
Sampling: The analog sound wave is measured at regular intervals, and each measurement is assigned a numerical value. The rate at which these measurements are taken is called the sampling rate and is typically measured in kilohertz (kHz).
Quantization: The numerical values obtained through sampling are then rounded to the nearest whole number. The number of bits used for quantization determines the dynamic range of the digital signal or the range between the quietest and loudest sounds that can be represented.
Encoding: The quantized values are then encoded into a digital format, such as a WAV or MP3 file, that can be stored on a computer or other digital storage media.
The resulting digital signal can be manipulated in various ways, such as edited, mixed with other audio signals, or played back through speakers or headphones. Consider the following audio signal. Also, please note that this audio clip will be used for feature extraction in the latter sections.
An audio clip that is to be processed by machine learning algorithms for predictions needs descriptors because the raw digitized form will not provide the necessary information for the model to learn patterns hidden in speech. These patterns are known as acoustic features. There are two primary categories of acoustic features: prosodic and spectral. Prosodic features depict speech’s rhythmic, intonational, and stress-related aspects. These features include the pitch, the intonation, and the speed of speech.
On the other hand, spectral features pertain to the energy distribution across different frequencies. These features can provide information about the quality or timbre of the sound, as well as other characteristics like pitch and loudness. Analyzing the frequencies in the vocal tract using spectral features can provide details about the speaker's gender, age, and other attributes. In this discussion, the focus will be on spectral features due to their ability to depict vocal tract characteristics.
Sound waves, audio signals, raw audio, and audio clip are all interchangeable terms throughout the text.
A digital waveform exists in the time domain consisting of time on the x-axis and amplitude on the y-axis. This signal can be transformed to the frequency domain using the Fourier transform: a method that converts the audio signal, represented by a function of time
As evident from the illustration above, the time information has been lost. To preserve information about amplitude, frequency, and time, we utilize short-time Fourier transform (STFT). This technique analyzes signals in the time-frequency domain and is based on the Fourier transform. Unlike analyzing the entire signal at once, the STFT analyzes short segments. A spectrogram is obtained after applying STFT on the signal:
A decibel (dB) is a logarithmic unit of measurement used to express the ratio between two values, such as the loudness of a sound.
The way humans perceive pitch is non-linear, so the current decibel scale does not accurately represent pitch perception. To address this, the Mel scale was developed as a perceptual scale based on how humans perceive pitch, rather than on the physical properties of sound. Named after Alexander Mel, this logarithmic scale represents pitches judged by listeners to be equal in distance from one another, with each pitch class double the frequency of its predecessor. The Mel scale can be converted to Hertz and vice versa using specific formulas:
The three major acoustic features used in training the models for speech-related tasks are discussed in detail below.
The Mel-spectrogram is a 2D image that quantifies time-frequency bins and is obtained by transforming the frequency to the Mel-scale, as demonstrated in the figure:
Applying a logarithmic magnitude on the spectrogram produces a cestrum. Mel frequency cepstral frequency coefficients (MFCCs) belong to the cepstral domain. The MFCC feature accurately represents the human auditory system, as the shape of the vocal tract filters the sounds generated by a human, and its shape determines the resulting sound or phoneme. To obtain MFCCs, the logarithm of the Mel-spectrogram is computed first, followed by the inverse Fourier transform as illustrated in the figure below.
The chroma feature (or chromagram) is an acoustic feature used in speech processing and music analysis. It represents energy distribution across the 12 pitch classes in the Western musical scale. This feature helps identify the tonality of music and can also be used to identify pitch patterns and chord progressions in speech or music. The chromagram is obtained by taking the STFT of the audio signal, mapping the frequency content onto the closest pitch class, and summing the energy within each pitch class across time.
After extracting the sound features, the next step is to identify a speech-processing task and feed these acoustic features to train a model. Consider the problem of speech emotion recognition (SER), where the model determines the emotion in an audio clip. Speech can be classified into four basic emotions: happy, sad, neutral, and angry. The following illustration shows how speech descriptors can be used to train a model:
This blog discussed how humans perceive pitch and understand speech, along with a method to extract acoustic features. Furthermore, it was demonstrated, with the help of an example, how these acoustic features can be used to perform emotion recognition. Similar steps can be used for other speech-related tasks such as speech recognition, speech synthesis, speaker identification and verification, speech enhancement, and speech-to-speech translation.
If you’re interested in learning more about audio signal processing and speech processing, look no further! Check out the following project on the Educative platform: Recognize Emotions from Speech using Librosa
Free Resources