Audio Processing & Feature Extraction
Overview
Audio processing is the foundation of all voice-based AI models. Before any machine learning can happen, raw audio must be converted into representations that neural networks can understand and process efficiently.
1. Raw Waveform (Time-Domain) Processing
Concept
This is direct processing of the original audio signal. Audio is treated as a sequence of amplitude values over time.
- Example: 44,100 samples per second in 44.1kHz audio
- No loss of information (everything the mic captured is preserved)
- High-dimensional input → needs large models and lots of compute
Advantages
- Complete information preservation
- Captures subtle acoustic details
- No preprocessing loss
Disadvantages
- Computationally expensive
- Requires large models
- High memory usage
Models Using Raw Waveforms
- Wav2Vec 2.0 (Meta) - Self-supervised representation learning
- Whisper (OpenAI) - Partially uses raw waveforms before converting
- WaveNet (DeepMind) - For TTS generation
2. Frequency-Domain Processing
Concept
Instead of using raw audio samples, we convert the waveform into the frequency spectrum using techniques like Fourier Transform.
The key insight: Audio is easier to work with when decomposed into its frequency components.
Processing Steps
- Take raw waveform
- Apply Fourier Transform (or Short-Time Fourier Transform)
- Get frequency information
- Optionally transform to human-perceptual scale (Mel)
- Convert to image-like format for neural networks
Spectrogram
A spectrogram is a visual representation of the spectrum of frequencies in a signal as they vary with time.
- X-axis: Time
- Y-axis: Frequency
- Color intensity: Amplitude/Energy at that frequency-time point
- Creates a 2D image from 1D audio signal
Mel Spectrogram
A Mel Spectrogram applies the Mel scale, which models how the human ear perceives frequencies (we’re better at distinguishing differences in lower frequencies than higher ones).
- Maps frequency scale to mel scale (human perception)
- Reduces dimensions while preserving important perceptual information
- More efficient than raw spectrogram for speech tasks
- Standard choice for TTS and ASR models
MFCC (Mel-Frequency Cepstral Coefficients)
MFCCs are a compact, hand-engineered feature set originally designed for speech processing:
- Derived from Mel spectrograms
- Further compressed using cepstral analysis
- Historically used in traditional speech models
- Still useful for small embedded systems
- Being replaced by learned representations in modern models
Advantages of Frequency-Domain Processing
- Reduced dimensionality (44,100 samples → ~100 frames)
- Aligns with human hearing perception
- Efficient for neural networks
- Works great with CNNs (2D grids like images)
Models Using Frequency-Domain
- Whisper (OpenAI) - Converts raw waveform → log-mel spectrogram
- Tacotron (TTS) - Uses mel spectrograms
- DeepSpeech - Uses MFCC-like features
3. Convolutional Neural Networks (CNN) for Audio
Why CNNs for Audio?
CNNs are specialized neural networks that:
- Detect patterns like edges, curves, textures
- Work best with grid-like data (images, spectrograms)
- Use filters (also called kernels) to slide over the input and extract local features
Audio as Images
When we convert audio into a 2D spectrogram, it becomes like an image:
- X-axis: Time
- Y-axis: Frequency
- Pixel Intensity: Energy at that time-frequency point
So CNNs can analyze patterns across time and frequency, just like they detect features in images.
CNN Architecture for Audio
Raw Audio Waveform
↓
Mel Spectrogram (2D grid)
↓
CNN Layers (Convolutions)
├─ Conv1: Detect basic patterns (pitch, texture)
├─ Conv2: Detect mid-level patterns (phonemes)
├─ Conv3: Detect high-level patterns (words, context)
↓
Feature Maps
↓
Fully Connected / Attention Layers
↓
Output (Classification, Transcription, etc.)
Key Concepts
- Local Connectivity: Convolutional filters only look at local regions, capturing patterns like “this frequency jumped at this time”
- Parameter Sharing: Same filter applied everywhere, learning general patterns
- Hierarchical Feature Learning: Early layers find simple patterns, later layers find complex patterns
4. Noise Cancellation & Denoising
Problem
Real-world audio contains background noise:
- Ambient noise
- Microphone noise
- Interference from other speakers
Noise Cancellation Models
Models designed to separate speech from noise in audio:
| Model | Parameters | RTF (CPU) | Architecture | Quality (PESQ) |
|---|---|---|---|---|
| RNNoise | 0.06M | 0.03 | GRU + DSP Hybrid | 2.33 |
| DeepFilterNet2 | 2.31M | 0.04 | Two-stage ERB | 3.08 |
| SpeechBrain SepFormer | 22.3M | ~0.15 | Transformer-based | 3.15 |
| Facebook Denoiser | 60.8M | 0.8 | U-Net Encoder-Decoder | 3.07 |
What is RTF (Real-Time Factor)?
RTF measures how fast a model runs relative to audio duration:
- RTF = 0.03 means the model processes 1 hour of audio in ~1.8 minutes
- RTF = 0.8 means the model processes 1 hour of audio in ~48 minutes
- Lower RTF = Faster = Better for real-time applications
How Denoising Works
- Input: Noisy audio waveform
- Analysis: Split into frequency bands or time frames
- Feature Extraction: Compute statistics of noise and signal
- Separation: Isolate speech from noise (using learned patterns or signal processing)
- Synthesis: Reconstruct clean speech
- Output: Denoised waveform
Denoising Techniques
- Spectral Subtraction: Subtract estimated noise spectrum from signal
- Wiener Filtering: Apply adaptive filter based on signal-to-noise ratio
- Deep Learning: Train neural networks to discriminate speech from noise
- Source Separation: Decompose audio into separate sources (speech, music, noise)
5. Complete Audio Processing Pipeline (Example)
Whisper’s Approach
Whisper (OpenAI) is a good reference for a complete pipeline:
Audio File (MP3, WAV, etc.)
↓
[Decoding] → Raw PCM waveform (16kHz sampling)
↓
[Windowing] → Frames (~20ms each with overlapping windows)
↓
[FFT] → Frequency representation for each frame
↓
[Mel Scale] → Map to mel scale (human perception)
↓
[Log Scale] → Apply log compression (normalize amplitude)
↓
[Normalization] → Standardize values (mean=0, std=1)
↓
[Output] → Log-mel spectrogram (80 frequencies × time steps)
↓
Input to Speech Model (Transformer Encoder)
MFCC Extraction Example
Audio Signal
↓
[Pre-emphasis] → Boost high frequencies
↓
[Windowing] → Apply window function (Hamming)
↓
[FFT] → Frequency spectrum
↓
[Mel Filter Bank] → Apply 40 overlapping triangular filters
↓
[Log Energy] → Take log of filter outputs
↓
[DCT] → Discrete Cosine Transform
↓
[Output] → 13 MFCC coefficients + derivatives
↓
Input to Speech Model (RNN, SVM, etc.)
6. Feature Extraction Concepts
Short-Time Fourier Transform (STFT)
Used to analyze frequency content of signals that vary over time:
- Divides audio into short overlapping windows
- Applies FFT to each window
- Produces time-frequency representation
- Foundation for spectrograms
Zero Crossing Rate (ZCR)
Counts how many times the audio signal crosses the zero amplitude line:
- High ZCR → High-frequency content (consonants, fricatives)
- Low ZCR → Low-frequency content (vowels)
- Used in VAD (voice activity detection)
Short-Time Energy (STE)
Measure of energy/power in a window of audio:
- Speech has higher energy than silence
- Useful for speech detection
- Sensitive to volume changes
Spectral Centroid
Center of mass of the frequency spectrum:
- High value → Bright, high-frequency sound
- Low value → Dark, low-frequency sound
- Useful for speech/music classification
Key Takeaways
- Raw waveform processing preserves all information but is computationally expensive
- Frequency-domain processing (mel spectrograms) is the standard for modern speech models
- CNNs are effective for analyzing spectrograms as 2D images
- Noise cancellation is often a preprocessing step
- Different tasks benefit from different feature representations