Goyalayus

audio data has two parts amplitude and frequency.

lets see how audio data is stored in your wav files

it mostly a list of integers representing amplitude of the sound.

sampling rate :- how many times per second the amplitude is calculated and stored in the list of integers. its generally 44,100 Hz, means 44k times sampled in a second

and the other is bit rate, it’s about how accurate is the amplitude value stored, commonly it’s 16 bit, meaning it can range from −32,768 to +32,767

this format while being okay is not that great for machine learning models.

we try to make this 1d data into 2d using the fourier transform

a fourier transform tells us how strong different frequency bands (for example 0–1 kHz, 1–2 kHz, 10–20 kHz, etc.) are in the signal

applying it once on the whole audio only gives the overall frequency distribution (no information about when a frequency occurred) → still 1d

to capture time-specific frequency information, we split the audio into short chunks (commonly 20–30 ms, e.g. 25 ms)

then we run the fourier transform on each chunk → this is called short-time fourier transform (STFT)

stacking these results gives us a 2d representation:

one axis = time (chunks)
one axis = frequency
values in the grid = amplitude/energy for each frequency at each time

this 2d representation is often visualized as a spectrogram

example:

pure 440 Hz tone → spectrogram shows a straight line at 440 Hz across time
speech → spectrogram shows changing frequency bands (formants) over time

mel-spectrogram

mel-spectrogram is derived from the spectrogram by mapping frequencies onto the mel scale

mel scale is a perceptual scale where equal steps sound equally spaced to human ears

humans perceive pitch roughly logarithmically (we notice differences at low frequencies more than at high)

so instead of linear frequency bins, mel-spectrogram compresses high-frequency ranges and expands low ones

result = representation closer to how humans actually hear sound

widely used in speech recognition and music analysis

mel-frequency cepstral coefficients (mfccs)

derived from mel-spectrogram by applying a discrete cosine transform (dct) to decorrelate frequency bands
captures the overall spectral envelope (timbre/shape of sound) rather than fine detail
commonly used in speech recognition because it represents how humans perceive phonemes
typical dimension: 12–13 coefficients per frame (sometimes + energy + delta + delta-delta → ~39 features per frame)

chroma features

group the spectrum into 12 bins, one for each pitch class (c, c#, d … b), regardless of octave
captures harmonic and melodic characteristics, useful for music analysis (e.g., chord detection, key recognition)
ignores exact frequency scale, focuses on pitch class energy distribution
typical dimension: 12 features per frame

all these mel-spectrogram and mel-frequency cepstral coefficients and chroma features they are like images so CNN’s work very well on them.

the hitchhiker’s guide to working with Audio Data

mel-spectrogram

mel-frequency cepstral coefficients (mfccs)

chroma features