Audio Processing & Feature Extraction

Overview

Audio processing is the foundation of all voice-based AI models. Before any machine learning can happen, raw audio must be converted into representations that neural networks can understand and process efficiently.

1. Raw Waveform (Time-Domain) Processing

Concept

This is direct processing of the original audio signal. Audio is treated as a sequence of amplitude values over time.

  • Example: 44,100 samples per second in 44.1kHz audio
  • No loss of information (everything the mic captured is preserved)
  • High-dimensional input → needs large models and lots of compute

Advantages

  • Complete information preservation
  • Captures subtle acoustic details
  • No preprocessing loss

Disadvantages

  • Computationally expensive
  • Requires large models
  • High memory usage

Models Using Raw Waveforms

  • Wav2Vec 2.0 (Meta) - Self-supervised representation learning
  • Whisper (OpenAI) - Partially uses raw waveforms before converting
  • WaveNet (DeepMind) - For TTS generation

2. Frequency-Domain Processing

Concept

Instead of using raw audio samples, we convert the waveform into the frequency spectrum using techniques like Fourier Transform.

The key insight: Audio is easier to work with when decomposed into its frequency components.

Processing Steps

  1. Take raw waveform
  2. Apply Fourier Transform (or Short-Time Fourier Transform)
  3. Get frequency information
  4. Optionally transform to human-perceptual scale (Mel)
  5. Convert to image-like format for neural networks

Spectrogram

A spectrogram is a visual representation of the spectrum of frequencies in a signal as they vary with time.

  • X-axis: Time
  • Y-axis: Frequency
  • Color intensity: Amplitude/Energy at that frequency-time point
  • Creates a 2D image from 1D audio signal

Mel Spectrogram

A Mel Spectrogram applies the Mel scale, which models how the human ear perceives frequencies (we’re better at distinguishing differences in lower frequencies than higher ones).

  • Maps frequency scale to mel scale (human perception)
  • Reduces dimensions while preserving important perceptual information
  • More efficient than raw spectrogram for speech tasks
  • Standard choice for TTS and ASR models

MFCC (Mel-Frequency Cepstral Coefficients)

MFCCs are a compact, hand-engineered feature set originally designed for speech processing:

  • Derived from Mel spectrograms
  • Further compressed using cepstral analysis
  • Historically used in traditional speech models
  • Still useful for small embedded systems
  • Being replaced by learned representations in modern models

Advantages of Frequency-Domain Processing

  • Reduced dimensionality (44,100 samples → ~100 frames)
  • Aligns with human hearing perception
  • Efficient for neural networks
  • Works great with CNNs (2D grids like images)

Models Using Frequency-Domain

  • Whisper (OpenAI) - Converts raw waveform → log-mel spectrogram
  • Tacotron (TTS) - Uses mel spectrograms
  • DeepSpeech - Uses MFCC-like features

3. Convolutional Neural Networks (CNN) for Audio

Why CNNs for Audio?

CNNs are specialized neural networks that:

  • Detect patterns like edges, curves, textures
  • Work best with grid-like data (images, spectrograms)
  • Use filters (also called kernels) to slide over the input and extract local features

Audio as Images

When we convert audio into a 2D spectrogram, it becomes like an image:

  • X-axis: Time
  • Y-axis: Frequency
  • Pixel Intensity: Energy at that time-frequency point

So CNNs can analyze patterns across time and frequency, just like they detect features in images.

CNN Architecture for Audio

Raw Audio Waveform
    ↓
Mel Spectrogram (2D grid)
    ↓
CNN Layers (Convolutions)
    ├─ Conv1: Detect basic patterns (pitch, texture)
    ├─ Conv2: Detect mid-level patterns (phonemes)
    ├─ Conv3: Detect high-level patterns (words, context)
    ↓
Feature Maps
    ↓
Fully Connected / Attention Layers
    ↓
Output (Classification, Transcription, etc.)

Key Concepts

  • Local Connectivity: Convolutional filters only look at local regions, capturing patterns like “this frequency jumped at this time”
  • Parameter Sharing: Same filter applied everywhere, learning general patterns
  • Hierarchical Feature Learning: Early layers find simple patterns, later layers find complex patterns

4. Noise Cancellation & Denoising

Problem

Real-world audio contains background noise:

  • Ambient noise
  • Microphone noise
  • Interference from other speakers

Noise Cancellation Models

Models designed to separate speech from noise in audio:

ModelParametersRTF (CPU)ArchitectureQuality (PESQ)
RNNoise0.06M0.03GRU + DSP Hybrid2.33
DeepFilterNet22.31M0.04Two-stage ERB3.08
SpeechBrain SepFormer22.3M~0.15Transformer-based3.15
Facebook Denoiser60.8M0.8U-Net Encoder-Decoder3.07

What is RTF (Real-Time Factor)?

RTF measures how fast a model runs relative to audio duration:

  • RTF = 0.03 means the model processes 1 hour of audio in ~1.8 minutes
  • RTF = 0.8 means the model processes 1 hour of audio in ~48 minutes
  • Lower RTF = Faster = Better for real-time applications

How Denoising Works

  1. Input: Noisy audio waveform
  2. Analysis: Split into frequency bands or time frames
  3. Feature Extraction: Compute statistics of noise and signal
  4. Separation: Isolate speech from noise (using learned patterns or signal processing)
  5. Synthesis: Reconstruct clean speech
  6. Output: Denoised waveform

Denoising Techniques

  • Spectral Subtraction: Subtract estimated noise spectrum from signal
  • Wiener Filtering: Apply adaptive filter based on signal-to-noise ratio
  • Deep Learning: Train neural networks to discriminate speech from noise
  • Source Separation: Decompose audio into separate sources (speech, music, noise)

5. Complete Audio Processing Pipeline (Example)

Whisper’s Approach

Whisper (OpenAI) is a good reference for a complete pipeline:

Audio File (MP3, WAV, etc.)
    ↓
[Decoding] → Raw PCM waveform (16kHz sampling)
    ↓
[Windowing] → Frames (~20ms each with overlapping windows)
    ↓
[FFT] → Frequency representation for each frame
    ↓
[Mel Scale] → Map to mel scale (human perception)
    ↓
[Log Scale] → Apply log compression (normalize amplitude)
    ↓
[Normalization] → Standardize values (mean=0, std=1)
    ↓
[Output] → Log-mel spectrogram (80 frequencies × time steps)
    ↓
Input to Speech Model (Transformer Encoder)

MFCC Extraction Example

Audio Signal
    ↓
[Pre-emphasis] → Boost high frequencies
    ↓
[Windowing] → Apply window function (Hamming)
    ↓
[FFT] → Frequency spectrum
    ↓
[Mel Filter Bank] → Apply 40 overlapping triangular filters
    ↓
[Log Energy] → Take log of filter outputs
    ↓
[DCT] → Discrete Cosine Transform
    ↓
[Output] → 13 MFCC coefficients + derivatives
    ↓
Input to Speech Model (RNN, SVM, etc.)

6. Feature Extraction Concepts

Short-Time Fourier Transform (STFT)

Used to analyze frequency content of signals that vary over time:

  • Divides audio into short overlapping windows
  • Applies FFT to each window
  • Produces time-frequency representation
  • Foundation for spectrograms

Zero Crossing Rate (ZCR)

Counts how many times the audio signal crosses the zero amplitude line:

  • High ZCR → High-frequency content (consonants, fricatives)
  • Low ZCR → Low-frequency content (vowels)
  • Used in VAD (voice activity detection)

Short-Time Energy (STE)

Measure of energy/power in a window of audio:

  • Speech has higher energy than silence
  • Useful for speech detection
  • Sensitive to volume changes

Spectral Centroid

Center of mass of the frequency spectrum:

  • High value → Bright, high-frequency sound
  • Low value → Dark, low-frequency sound
  • Useful for speech/music classification

Key Takeaways

  1. Raw waveform processing preserves all information but is computationally expensive
  2. Frequency-domain processing (mel spectrograms) is the standard for modern speech models
  3. CNNs are effective for analyzing spectrograms as 2D images
  4. Noise cancellation is often a preprocessing step
  5. Different tasks benefit from different feature representations