Audio Processing & Feature Extraction

Overview

Audio processing is the foundation of all voice-based AI models. Before any machine learning can happen, raw audio must be converted into representations that neural networks can understand and process efficiently.

1. Raw Waveform (Time-Domain) Processing

Concept

This is direct processing of the original audio signal. Audio is treated as a sequence of amplitude values over time.

Example: 44,100 samples per second in 44.1kHz audio
No loss of information (everything the mic captured is preserved)
High-dimensional input → needs large models and lots of compute

Advantages

Complete information preservation
Captures subtle acoustic details
No preprocessing loss

Disadvantages

Computationally expensive
Requires large models
High memory usage

Models Using Raw Waveforms

Wav2Vec 2.0 (Meta) - Self-supervised representation learning
Whisper (OpenAI) - Partially uses raw waveforms before converting
WaveNet (DeepMind) - For TTS generation

2. Frequency-Domain Processing

Concept

Instead of using raw audio samples, we convert the waveform into the frequency spectrum using techniques like Fourier Transform.

The key insight: Audio is easier to work with when decomposed into its frequency components.

Processing Steps

Take raw waveform
Apply Fourier Transform (or Short-Time Fourier Transform)
Get frequency information
Optionally transform to human-perceptual scale (Mel)
Convert to image-like format for neural networks

Spectrogram

A spectrogram is a visual representation of the spectrum of frequencies in a signal as they vary with time.

X-axis: Time
Y-axis: Frequency
Color intensity: Amplitude/Energy at that frequency-time point
Creates a 2D image from 1D audio signal

Mel Spectrogram

A Mel Spectrogram applies the Mel scale, which models how the human ear perceives frequencies (we’re better at distinguishing differences in lower frequencies than higher ones).

Maps frequency scale to mel scale (human perception)
Reduces dimensions while preserving important perceptual information
More efficient than raw spectrogram for speech tasks
Standard choice for TTS and ASR models

MFCC (Mel-Frequency Cepstral Coefficients)

MFCCs are a compact, hand-engineered feature set originally designed for speech processing:

Derived from Mel spectrograms
Further compressed using cepstral analysis
Historically used in traditional speech models
Still useful for small embedded systems
Being replaced by learned representations in modern models

Advantages of Frequency-Domain Processing

Reduced dimensionality (44,100 samples → ~100 frames)
Aligns with human hearing perception
Efficient for neural networks
Works great with CNNs (2D grids like images)

Models Using Frequency-Domain

Whisper (OpenAI) - Converts raw waveform → log-mel spectrogram
Tacotron (TTS) - Uses mel spectrograms
DeepSpeech - Uses MFCC-like features

3. Convolutional Neural Networks (CNN) for Audio

Why CNNs for Audio?

CNNs are specialized neural networks that:

Detect patterns like edges, curves, textures
Work best with grid-like data (images, spectrograms)
Use filters (also called kernels) to slide over the input and extract local features

Audio as Images

When we convert audio into a 2D spectrogram, it becomes like an image:

X-axis: Time
Y-axis: Frequency
Pixel Intensity: Energy at that time-frequency point

So CNNs can analyze patterns across time and frequency, just like they detect features in images.

CNN Architecture for Audio

Raw Audio Waveform
    ↓
Mel Spectrogram (2D grid)
    ↓
CNN Layers (Convolutions)
    ├─ Conv1: Detect basic patterns (pitch, texture)
    ├─ Conv2: Detect mid-level patterns (phonemes)
    ├─ Conv3: Detect high-level patterns (words, context)
    ↓
Feature Maps
    ↓
Fully Connected / Attention Layers
    ↓
Output (Classification, Transcription, etc.)

Key Concepts

Local Connectivity: Convolutional filters only look at local regions, capturing patterns like “this frequency jumped at this time”
Parameter Sharing: Same filter applied everywhere, learning general patterns
Hierarchical Feature Learning: Early layers find simple patterns, later layers find complex patterns

4. Noise Cancellation & Denoising

Problem

Real-world audio contains background noise:

Ambient noise
Microphone noise
Interference from other speakers

Noise Cancellation Models

Models designed to separate speech from noise in audio:

Model	Parameters	RTF (CPU)	Architecture	Quality (PESQ)
RNNoise	0.06M	0.03	GRU + DSP Hybrid	2.33
DeepFilterNet2	2.31M	0.04	Two-stage ERB	3.08
SpeechBrain SepFormer	22.3M	~0.15	Transformer-based	3.15
Facebook Denoiser	60.8M	0.8	U-Net Encoder-Decoder	3.07

What is RTF (Real-Time Factor)?

RTF measures how fast a model runs relative to audio duration:

RTF = 0.03 means the model processes 1 hour of audio in ~1.8 minutes
RTF = 0.8 means the model processes 1 hour of audio in ~48 minutes
Lower RTF = Faster = Better for real-time applications

How Denoising Works

Input: Noisy audio waveform
Analysis: Split into frequency bands or time frames
Feature Extraction: Compute statistics of noise and signal
Separation: Isolate speech from noise (using learned patterns or signal processing)
Synthesis: Reconstruct clean speech
Output: Denoised waveform

Denoising Techniques

Spectral Subtraction: Subtract estimated noise spectrum from signal
Wiener Filtering: Apply adaptive filter based on signal-to-noise ratio
Deep Learning: Train neural networks to discriminate speech from noise
Source Separation: Decompose audio into separate sources (speech, music, noise)

5. Complete Audio Processing Pipeline (Example)

Whisper’s Approach

Whisper (OpenAI) is a good reference for a complete pipeline:

Audio File (MP3, WAV, etc.)
    ↓
[Decoding] → Raw PCM waveform (16kHz sampling)
    ↓
[Windowing] → Frames (~20ms each with overlapping windows)
    ↓
[FFT] → Frequency representation for each frame
    ↓
[Mel Scale] → Map to mel scale (human perception)
    ↓
[Log Scale] → Apply log compression (normalize amplitude)
    ↓
[Normalization] → Standardize values (mean=0, std=1)
    ↓
[Output] → Log-mel spectrogram (80 frequencies × time steps)
    ↓
Input to Speech Model (Transformer Encoder)

MFCC Extraction Example

Audio Signal
    ↓
[Pre-emphasis] → Boost high frequencies
    ↓
[Windowing] → Apply window function (Hamming)
    ↓
[FFT] → Frequency spectrum
    ↓
[Mel Filter Bank] → Apply 40 overlapping triangular filters
    ↓
[Log Energy] → Take log of filter outputs
    ↓
[DCT] → Discrete Cosine Transform
    ↓
[Output] → 13 MFCC coefficients + derivatives
    ↓
Input to Speech Model (RNN, SVM, etc.)

6. Feature Extraction Concepts

Short-Time Fourier Transform (STFT)

Used to analyze frequency content of signals that vary over time:

Divides audio into short overlapping windows
Applies FFT to each window
Produces time-frequency representation
Foundation for spectrograms

Zero Crossing Rate (ZCR)

Counts how many times the audio signal crosses the zero amplitude line:

High ZCR → High-frequency content (consonants, fricatives)
Low ZCR → Low-frequency content (vowels)
Used in VAD (voice activity detection)

Short-Time Energy (STE)

Measure of energy/power in a window of audio:

Speech has higher energy than silence
Useful for speech detection
Sensitive to volume changes

Spectral Centroid

Center of mass of the frequency spectrum:

High value → Bright, high-frequency sound
Low value → Dark, low-frequency sound
Useful for speech/music classification

Key Takeaways

Raw waveform processing preserves all information but is computationally expensive
Frequency-domain processing (mel spectrograms) is the standard for modern speech models
CNNs are effective for analyzing spectrograms as 2D images
Noise cancellation is often a preprocessing step
Different tasks benefit from different feature representations

Audio Processing - How ML Models Understand Sound

Table of Contents

Audio Processing & Feature Extraction

Overview

1. Raw Waveform (Time-Domain) Processing

Concept

Advantages

Disadvantages

Models Using Raw Waveforms

2. Frequency-Domain Processing

Concept

Processing Steps

Spectrogram

Mel Spectrogram

MFCC (Mel-Frequency Cepstral Coefficients)

Advantages of Frequency-Domain Processing

Models Using Frequency-Domain

3. Convolutional Neural Networks (CNN) for Audio

Why CNNs for Audio?

Audio as Images

CNN Architecture for Audio

Key Concepts

4. Noise Cancellation & Denoising

Problem

Noise Cancellation Models

What is RTF (Real-Time Factor)?

How Denoising Works

Denoising Techniques

5. Complete Audio Processing Pipeline (Example)

Whisper’s Approach

MFCC Extraction Example

6. Feature Extraction Concepts

Short-Time Fourier Transform (STFT)

Zero Crossing Rate (ZCR)

Short-Time Energy (STE)

Spectral Centroid

Key Takeaways

Graph View

Table of Contents