Home/Blog/Programming/Audio Signal Processing: How machines understand audio signals

Audio Signal Processing: How machines understand audio signals

6 min read

May 25, 2023

content

Audio waves

Acoustic Features

Time vs. frequency domain

How humans perceive pitch

Spectral features

Mel-spectrograms

Mel-frequency cepstral coefficients (MFCCs)

Chroma feature (chromagram)

Example: Speech emotion recognition (SER)

Conclusion

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Audio waves#

Audio waves are vibrations of air molecules that propagate through a medium such as air or water. These waves are created by sound sources, such as musical instruments, speakers, or the human voice, and are perceived by the human ear as sound. Digitizing audio waves refers to converting analog sound waves into digital signals that a machine can store and manipulate. This process involves several steps:

Sampling: The analog sound wave is measured at regular intervals, and each measurement is assigned a numerical value. The rate at which these measurements are taken is called the sampling rate and is typically measured in kilohertz (kHz).
Quantization: The numerical values obtained through sampling are then rounded to the nearest whole number. The number of bits used for quantization determines the dynamic range of the digital signal or the range between the quietest and loudest sounds that can be represented.
Encoding: The quantized values are then encoded into a digital format, such as a WAV or MP3 file, that can be stored on a computer or other digital storage media.

An audio clip that is to be processed by machine learning algorithms for predictions needs descriptors because the raw digitized form will not provide the necessary information for the model to learn patterns hidden in speech. These patterns are known as acoustic features. There are two primary categories of acoustic features: prosodic and spectral. Prosodic features depict speech’s rhythmic, intonational, and stress-related aspects. These features include the pitch, the intonation, and the speed of speech.

On the other hand, spectral features pertain to the energy distribution across different frequencies. These features can provide information about the quality or timbre of the sound, as well as other characteristics like pitch and loudness. Analyzing the frequencies in the vocal tract using spectral features can provide details about the speaker's gender, age, and other attributes. In this discussion, the focus will be on spectral features due to their ability to depict vocal tract characteristics.

Sound waves, audio signals, raw audio, and audio clip are all interchangeable terms throughout the text.

Time vs. frequency domain#

A digital waveform exists in the time domain consisting of time on the x-axis and amplitude on the y-axis. This signal can be transformed to the frequency domain using the Fourier transform: a method that converts the audio signal, represented by a function of time $y(t)$ , into a function of frequency $Y(f)$ . As a result, a power spectrum is obtained. Python's Librosa library can be used to convert a waveform from a time to a frequency domain signal:

A decibel (dB) is a logarithmic unit of measurement used to express the ratio between two values, such as the loudness of a sound.

How humans perceive pitch#

The way humans perceive pitch is non-linear, so the current decibel scale does not accurately represent pitch perception. To address this, the Mel scale was developed as a perceptual scale based on how humans perceive pitch, rather than on the physical properties of sound. Named after Alexander Mel, this logarithmic scale represents pitches judged by listeners to be equal in distance from one another, with each pitch class double the frequency of its predecessor. The Mel scale can be converted to Hertz and vice versa using specific formulas:

$m=2595.log(1+\frac{f}{500})$

$f=700 (10^{\frac{m}{2595}}-1)$

Spectral features#

The three major acoustic features used in training the models for speech-related tasks are discussed in detail below.

Mel-spectrograms#

The Mel-spectrogram is a 2D image that quantifies time-frequency bins and is obtained by transforming the frequency to the Mel-scale, as demonstrated in the figure:

Mel-frequency cepstral coefficients (MFCCs) #

Applying a logarithmic magnitude on the spectrogram produces a cestrum. Mel frequency cepstral frequency coefficients (MFCCs) belong to the cepstral domain. The MFCC feature accurately represents the human auditory system, as the shape of the vocal tract filters the sounds generated by a human, and its shape determines the resulting sound or phoneme. To obtain MFCCs, the logarithm of the Mel-spectrogram is computed first, followed by the inverse Fourier transform as illustrated in the figure below.

Chroma feature (chromagram)#

The chroma feature (or chromagram) is an acoustic feature used in speech processing and music analysis. It represents energy distribution across the 12 pitch classes in the Western musical scale. This feature helps identify the tonality of music and can also be used to identify pitch patterns and chord progressions in speech or music. The chromagram is obtained by taking the STFT of the audio signal, mapping the frequency content onto the closest pitch class, and summing the energy within each pitch class across time.

Conclusion#

This blog discussed how humans perceive pitch and understand speech, along with a method to extract acoustic features. Furthermore, it was demonstrated, with the help of an example, how these acoustic features can be used to perform emotion recognition. Similar steps can be used for other speech-related tasks such as speech recognition, speech synthesis, speaker identification and verification, speech enhancement, and speech-to-speech translation.

If you’re interested in learning more about audio signal processing and speech processing, look no further! Check out the following project on the Educative platform: Recognize Emotions from Speech using Librosa

Written By:

Nimra Zaheer