DDSP From Scratch | Music AI Reading Group

What is DDSP?

Most generative models for audio operate directly on waveform samples (i.e., raw audio) or frequency-domain representations (e.g., Fourier transforms), and while these approaches are general enough to express any signal, they do not leverage existing knowledge of how sound is produced and perceived, requiring the model to learn physics from scratch. Differentiable Digital Signal Processing (DDSP), introduced by Engel et al. (2020), takes a different approach, integrating classic signal processing components (e.g., oscillators, filters, reverb) into automatic differentiation frameworks. This allows gradients to flow from the audio output back through the entire synthesis pipeline, making every stage amenable to backpropagation.

Most of what makes DDSP effective in practice comes from inductive bias. Rather than asking a neural network to learn the physics of sound generation from data, DDSP encodes that structure directly into the model architecture, and the network need only learn to control it. This page accompanies our reading group session on Engel et al. (2020), where we walked through the paper’s core ideas and built a minimal DDSP synthesizer from scratch in PyTorch to help ground our understanding. Below we present interactive visualizations of the main concepts, audio examples at various stages of training, and resources for further reading.

Core Concepts

Harmonic Oscillator

A bank of sinusoids at integer multiples of f0. Phase is accumulated via cumulative sum, which keeps oscillators continuous when frequency changes (i.e., the oscillator adjusts its speed from its current position rather than jumping). Implemented as cumsum + sin in PyTorch.

Filtered Noise

White noise shaped by a learned, time-varying filter (i.e., a per-frame EQ curve). We FFT each noise frame, multiply by the predicted filter magnitudes, and IFFT back. This captures the stochastic components of sound (e.g., bow noise, breath) that sinusoidal harmonics cannot easily represent.

Multi-Scale Spectral Loss

We compare magnitude spectrograms, not raw waveforms, at six FFT sizes (64 to 2048). The linear term penalizes errors in loud components while the log term catches subtle, perceptually salient details. In a production analogy, this is roughly akin to checking a mix on six different monitoring systems.

Inductive Bias

The oscillator bank encodes the physics of sound production (e.g., harmonic structure, phase coherence) that a neural network would otherwise need to learn from data. We posit that this is the paper's most salient contribution: DDSP models achieve high-fidelity synthesis at roughly 300x fewer parameters than WaveNet.

Phase Accumulation

A challenge in oscillator design is maintaining phase continuity when frequency changes. The naive approach, computing phase as 2π · f · t, produces a discontinuity (an audible click) because the phase value jumps when f changes. Phase accumulation resolves this by maintaining a running sum of frequency-dependent increments, allowing the oscillator to adjust its rate of oscillation from its current position. An analogy is a turntable: changing the platter speed does not move the needle, but it changes how fast the groove passes beneath it. Click the button below to trigger a frequency change and observe the difference.

Phase Accumulation

Naive: sin(2πft)

Accumulated: sin(cumsum(Δφ))

Click "Change freq" to jump from 3 Hz to 7 Hz. Notice the phase discontinuity in the naive approach.

Harmonic Series

In the Harmonic plus Noise model (Serra and Smith, 1990), all harmonic frequencies are constrained to be integer multiples of a fundamental frequency f0. This means that a single parameter controls the position of every harmonic in the spectrum, and the relative amplitudes c_k (bar heights) determine the timbre independently of pitch. Drag the slider to observe how adjusting f0 moves all harmonics in lockstep; this is why the neural network can control pitch and timbre as separate, interpretable dimensions.

Harmonic Series

f0 440 Hz

Drag the harmonic bars to reshape the timbre. Toggle "Play" to hear the result.

Filtered Noise

Natural sounds contain both harmonic and stochastic components (e.g., the breathy quality of a flute, the scrape of a bow on a string). The filtered noise synthesizer addresses the latter by starting with white noise, which has equal energy at all frequencies (we assume Gaussian at this point), and shaping it through a learned, time-varying filter. In practice, the neural network predicts filter magnitudes H_l for each frame, and we apply them via pointwise multiplication in the frequency domain, which is equivalent to time-domain convolution but at a significantly lower computational cost. Drag the control points below to sculpt the noise spectrum and observe how the filter shapes the output.

Filtered Noise

Drag control points to shape the noise filter. Grey = raw noise, purple = filtered output.

Training: Watching the Optimizer Learn

Given that every operation in our synthesizer is differentiable (i.e., cumsum, sin, fft, pointwise multiplication), gradients can flow from the multi-scale spectral loss all the way back to the synthesis parameters. In the visualization below, we simulate this optimization process: the predicted waveform (orange) begins with random harmonic amplitudes and progressively converges toward the target (teal) as the loss decreases. In the real DDSP pipeline, a neural encoder-decoder predicts these parameters from audio, but the principle of gradient flow through the synthesizer remains the same.

Training Loop

step 0 loss: —

Watch gradient descent converge: orange waveform approaches the teal target.

Signal Chain

Audio Experiments

In each experiment below, we provide f0 as ground truth, mirroring the real DDSP pipeline where a pre-trained pitch tracker (CREPE; Kim et al., 2018) extracts fundamental frequency from the audio. We found through experimentation that optimizing f0 directly through sin(cumsum(...)) produces a highly non-convex loss landscape, which may explain the paper's design choice to condition on extracted pitch rather than learning it end-to-end. The optimizer learns only harmonic amplitudes and noise filter parameters. Click any cell to play and observe how the reconstruction improves over training steps.

Experiment 1: A440, 3 harmonics

Target: a 440 Hz fundamental with 2nd and 3rd harmonics at relative amplitudes of 1.0, 0.5, and 0.25, producing a simple instrument-like timbre.

Loading audio players...

Experiment 2: C261, 6 harmonics

Target: 261.63 Hz (middle C) with six harmonics of decreasing amplitude. The richer harmonic content presents a more challenging optimization landscape for the synthesizer.

Loading audio players...

Experiment 3: E329, odd harmonics (clarinet-like)

Target: 329.63 Hz with only odd-numbered harmonics (1st, 3rd, 5th, 7th), with even harmonics zeroed out. This produces the characteristically hollow timbre associated with closed-pipe resonators (e.g., clarinet).

Loading audio players...

Run It Yourself

# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and set up
git clone https://github.com/mclemcrew/music-ai-reading-group.git
cd music-ai-reading-group/experiments/ddsp

# Install dependencies (uv reads pyproject.toml automatically)
uv sync

# Run the experiments (generates audio snapshots in docs/audio/)
uv run ddsp_from_sratch.py