Audio Generation
Explore the fundamentals of audio processing and audio generation model: AudioCraft.
Generative AI isn’t limited to images or text—it has also made huge strides in the audio domain.
Fundamentals of audio processing
Before exploring the specifics of speech generation, it’s essential to understand how AI systems represent and process audio data.
Audio signals and feature extraction:
At its most basic, an audio signal is a waveform that represents sound over time. However, raw audio isn’t typically fed directly into AI models. Instead, we extract features—numerical representations that capture the essence of the sound. Common techniques include:Spectrograms: Visual representations of the spectrum of frequencies in a sound signal as they vary with time. The mel spectrogram is a 2D image that quantifies time-frequency bins and is obtained by transforming the frequency to the
.mel scale The mel scale was developed as a perceptual scale based on how humans perceive pitch, rather than on the physical properties of sound. Mel-frequency cepstral coefficients (MFCCs): Features that mimic how the human ear perceives sound, emphasizing frequencies most important for hearing.
The chroma feature (or chromagram) is an acoustic feature used in speech processing and music analysis. It represents energy distribution across the 12 pitch classes in the Western musical scale.
Preprocessing:
Just as images might be resized or normalized, audio data undergoes preprocessing. This can include ...