audio data has two parts amplitude and frequency.
lets see how audio data is stored in your wav files
it mostly a list of integers representing amplitude of the sound.
sampling rate :- how many times per second the amplitude is calculated and stored in the list of integers. its generally 44,100 Hz, means 44k times sampled in a second
and the other is bit rate, it’s about how accurate is the amplitude value stored, commonly it’s 16 bit, meaning it can range from −32,768 to +32,767
this format while being okay is not that great for machine learning models.
we try to make this 1d data into 2d using the fourier transform
a fourier transform tells us how strong different frequency bands (for example 0–1 kHz, 1–2 kHz, 10–20 kHz, etc.) are in the signal
applying it once on the whole audio only gives the overall frequency distribution (no information about when a frequency occurred) → still 1d
to capture time-specific frequency information, we split the audio into short chunks (commonly 20–30 ms, e.g. 25 ms)
then we run the fourier transform on each chunk → this is called short-time fourier transform (STFT)
stacking these results gives us a 2d representation:
one axis = time (chunks)
one axis = frequency
values in the grid = amplitude/energy for each frequency at each time
this 2d representation is often visualized as a spectrogram
example:
pure 440 Hz tone → spectrogram shows a straight line at 440 Hz across time
speech → spectrogram shows changing frequency bands (formants) over time
mel-spectrogram
mel-spectrogram is derived from the spectrogram by mapping frequencies onto the mel scale
mel scale is a perceptual scale where equal steps sound equally spaced to human ears
humans perceive pitch roughly logarithmically (we notice differences at low frequencies more than at high)
so instead of linear frequency bins, mel-spectrogram compresses high-frequency ranges and expands low ones
result = representation closer to how humans actually hear sound
widely used in speech recognition and music analysis
mel-frequency cepstral coefficients (mfccs)
derived from mel-spectrogram by applying a discrete cosine transform (dct) to decorrelate frequency bands
captures the overall spectral envelope (timbre/shape of sound) rather than fine detail
commonly used in speech recognition because it represents how humans perceive phonemes
typical dimension: 12–13 coefficients per frame (sometimes + energy + delta + delta-delta → ~39 features per frame)
chroma features
group the spectrum into 12 bins, one for each pitch class (c, c#, d … b), regardless of octave
captures harmonic and melodic characteristics, useful for music analysis (e.g., chord detection, key recognition)
ignores exact frequency scale, focuses on pitch class energy distribution
typical dimension: 12 features per frame
all these mel-spectrogram and mel-frequency cepstral coefficients and chroma features they are like images so CNN’s work very well on them.