Music Generation | Music AI Reading Group

What is Music Generation?

Music generation is the task of producing new musical content, either as raw audio waveforms or as symbolic notation (MIDI). Where transcription maps audio to symbols, generation goes the other direction, and the core challenge shifts from recognition to coherence: the model must produce output that is not only acoustically valid but musically sensible over time spans of seconds to minutes, maintaining structure, style, and harmonic consistency.

This session covered two papers that approach generation from opposite ends of the representation spectrum. VampNet (Garcia et al., 2023) operates in the audio domain, encoding raw waveforms into discrete codec tokens and using a bidirectional masked transformer to generate music through iterative parallel decoding. The Anticipatory Music Transformer (Thickstun et al., 2023) operates in the symbolic domain, modeling MIDI events as a temporal point process and introducing an interleaving strategy that enables controllable infilling. Both use transformers, but their conditioning strategies are fundamentally different: VampNet attends bidirectionally (seeing both past and future context), while AMT uses a causal architecture with a novel anticipation mechanism that inserts future control signals into the autoregressive stream.

Discrete Audio Tokens

Both VampNet and modern audio language models rely on neural audio codecs to convert continuous waveforms into discrete token sequences. The Descript Audio Codec (DAC) uses Residual Vector Quantization (RVQ) to compress 44.1 kHz audio into 9 codebook levels at ~90x compression. The first codebook captures coarse structure (pitch, rhythm), while deeper codebooks capture progressively finer spectral detail (timbre, texture). This gives the transformer a tractable discrete vocabulary to operate on, analogous to subword tokens in NLP.

Masked Token Modeling

VampNet borrows the masked language modeling objective from BERT, adapted for audio tokens. During training, a variable fraction of tokens are masked (replaced with a special [MASK] token), and the model predicts the original values. Critically, the transformer is bidirectional, attending to all unmasked tokens in both directions. This is a departure from autoregressive models (GPT-style), which can only condition on the past. Bidirectional context is essential for tasks like inpainting, where the model must respect both what came before and after a masked region.

Iterative Parallel Decoding

Rather than generating one token at a time (as in autoregressive models), VampNet generates all masked tokens simultaneously in each pass, then re-masks the least confident predictions and repeats. Over approximately 36 passes, the token grid transitions from fully masked to fully populated. This is significantly faster than autoregressive generation for equivalent sequence lengths, and the coarse-to-fine structure (predicting coarse codebooks first, then fine) ensures that high-level musical structure is established before fine spectral detail is filled in.

Anticipation

The Anticipatory Music Transformer introduces a conditioning mechanism where control tokens (e.g., melody notes provided as input) are interleaved with event tokens (accompaniment to be generated) in the sequence, placed δ=5 seconds before the events they condition. This preserves locality, so each control appears close to the events it should influence, unlike seq2seq approaches which place all input tokens before all output tokens. The result is a causal autoregressive model that can nevertheless perform infilling and accompaniment tasks.

VampNet: Masked Acoustic Token Modeling

VampNet (Garcia et al., 2023) is a non-autoregressive model for music audio generation. The key design choice is to work in the space of discrete audio tokens produced by the Descript Audio Codec (DAC), a neural audio compression model that uses Residual Vector Quantization (RVQ) to encode 44.1 kHz audio into a matrix of discrete codes across 9 codebook levels. The first codebook (c0) captures coarse spectral structure, pitch contour, and rhythmic information; deeper codebooks (c1 through c8) capture progressively finer spectral details such as timbre nuance and high-frequency texture. Together, the 9 levels achieve approximately 90x compression while maintaining high perceptual fidelity.

Residual Vector Quantization

DAC encodes audio into 9 codebook levels via residual vector quantization. Each level captures finer detail — c0 is coarse structure, c8 is fine texture. Use +/- to change levels, then listen to the difference.

Use the play buttons above to hear what each codebook level adds. With only c0, you get pitch and rhythm but the timbre is flat and metallic. By 4 levels the sound is recognizable. The final levels add the high-frequency detail that makes the reconstruction perceptually transparent. Try stepping through all 9 levels while listening.

The generation pipeline has two stages. A coarse transformer operates on the first codebook level (c0), predicting masked tokens using a bidirectional attention mechanism that can attend to all unmasked positions in the sequence, in both temporal directions. A fine transformer (c2f) then upsamples the coarse tokens to the remaining codebook levels (c1 through c8), adding the spectral detail needed for high-fidelity reconstruction. Both transformers are trained with a masked token modeling objective: during training, a variable fraction of tokens are masked according to a schedule (which can be linear, cosine, or randomly sampled), and the model learns to predict the original token values from the remaining context.

At inference time, VampNet uses iterative parallel decoding. The generation begins with the entire token grid masked (except for any tokens provided as a prompt). In each of approximately 36 sampling passes, the model predicts all masked positions simultaneously. The predictions with the highest confidence are accepted, and the remaining positions are re-masked for the next pass. The number of tokens unmasked per pass follows a cosine schedule, so early passes establish coarse structure and later passes fill in details. This is substantially faster than autoregressive generation for equivalent sequence lengths, as each pass predicts many tokens at once.

Iterative Parallel Decoding

VampNet decodes in ~36 parallel passes (compressed to 12 here). Each pass predicts a batch of masked tokens simultaneously — coarse codebook levels (c0) are filled first, then finer levels. Click Step or Play to watch.

The flexible masking interface enables several musical applications. Vamping (looping with variation) is achieved by providing a periodic prompt, keeping every N-th token as context and masking the rest. The model generates a variation that maintains the style, genre, and instrumentation of the original while introducing new musical content. Inpainting masks a contiguous time region while keeping the surrounding context, allowing the model to fill in a musically coherent bridge. Compression encodes audio as a sparse set of prompt tokens plus the masking pattern, and continuation provides the beginning of a piece and masks the rest. All of these tasks use the same model and the same inference procedure, differing only in which tokens are masked.

VampNet Audio Demos

Each 10-second clip is processed through four masking strategies using the same pre-trained model. The electronic loop uses high-temperature, sparse prompts (wild/creative), while the guitar loop uses low-temperature, dense prompts (faithful/polished) — same model, same strategies, different parameter regimes.

Electronic Loop (wild: temp=1.0–1.2, sparse prompts)

Original

Vamping (every 9th, t=1.2)

Vamping (every 5th, t=1.0)

Inpainting (3s–7s, t=1.0)

Beat-driven (t=1.1)

Guitar Loop (faithful: temp=0.6–0.7, dense prompts)

Original

Vamping (every 5th, t=0.7)

Vamping (every 2nd, t=0.6)

Inpainting (3s–7s, t=0.6)

Beat-driven (t=0.7)

Compare the two approaches: the electronic loop with high temperature and sparse prompts produces creative, sometimes surprising variations — the model is given freedom to explore. The guitar loop with low temperature and dense prompts stays much closer to the original, generating subtle variations that preserve the character of the source material. Inpainting replaces seconds 3–7 with new content that bridges the surrounding context. Beat-driven masking preserves the rhythmic skeleton and regenerates everything between beats. Both examples use the same model and inference procedure — only the mask pattern and temperature differ.

Anticipatory Music Transformer

The Anticipatory Music Transformer (Thickstun et al., 2023) takes a fundamentally different approach. Rather than operating on audio tokens, it models symbolic music (MIDI) as a temporal point process, treating each musical event as a tuple of (arrival time, duration, note) where the note combines pitch and instrument identity. The vocabulary contains 27,512 tokens: 10,000 quantized time values (at 10 ms resolution), 1,000 duration values, and 16,512 instrument-pitch pairs. The model is a 360M-parameter causal (GPT-style) transformer trained on the Lakh MIDI Dataset, which contains 176,581 MIDI files totaling approximately 1.99 billion tokens.

The central innovation is anticipation, best understood by comparing four approaches to conditioning. In autoregressive generation, the model produces events left-to-right with no control input. Seq2seq prepends all control tokens before events, enabling conditioning but creating long-range dependencies (a control for t = 15s may be hundreds of tokens away). Sorting by time interleaves controls with events at their actual timestamps, preserving locality but violating stopping time constraints, which means the model cannot determine during autoregressive sampling when to insert the next control. Anticipation resolves this by placing each control δ seconds before its target event. This preserves locality AND defines valid stopping times for sampling. Step through the four modes below, and drag the δ slider to see how the anticipation interval affects control placement.

Anticipatory Interleaving

Unconditional generation: the model produces events left-to-right with no control input. No conditioning, no infilling.

Training uses a 30x data augmentation strategy with four control patterns: 10% span anticipation (consecutive event spans as controls), 40% instrument anticipation (all events from one instrument as controls, simulating accompaniment), 40% random anticipation (a random 10–90% subset of events), and 10% unconditional (no controls). This mixture ensures the model can handle diverse infilling scenarios at inference time without task-specific fine-tuning. At generation time, the model uses nucleus sampling (p = 0.95) and enforces monotonic time ordering plus proper token sequencing (time → duration → note) as structural constraints.

Human evaluators found that generated accompaniments were comparable in musicality to human-composed passages over 20-second segments (18 wins, 31 ties, 11 losses vs. human; p = 0.194, not significantly different). This is a strong result given that the model was not fine-tuned for accompaniment specifically, but learned it as one of many infilling patterns during training. The model also matches or exceeds autoregressive baselines for unconditional prompt continuation, demonstrating that the anticipatory training objective does not degrade standard generation quality.

How They Connect

These two papers, published within a month of each other in 2023, represent two poles of the music generation design space: audio-domain vs. symbolic-domain, bidirectional masking vs. causal autoregressive, and parallel decoding vs. sequential sampling. The shared thread is controllable generation. Both papers treat the ability to condition generation on partial musical context (a time region, a melody, a style reference) as a first-class design goal rather than an afterthought. VampNet achieves this through flexible masking patterns; AMT achieves it through anticipatory interleaving.

A practical consideration that distinguishes the two: VampNet produces audio directly (via DAC decoding), so the output quality is bounded by the codec’s fidelity. AMT produces MIDI, which has perfect symbolic fidelity but requires a separate synthesizer to render audio, so the sonic quality depends entirely on the synthesis engine. We noted during the session that AMT’s MIDI output rendered through a basic synthesizer has a characteristic 90s-video-game quality, while VampNet’s audio output inherits the genre and production quality of its training data. Both approaches have merits, and they are in some sense complementary: symbolic models offer precise structural control, while audio models capture the nuances of timbre and production that symbolic representations discard.

Run It Yourself

Two scripts are provided. dac_experiments.py reconstructs audio at each of the 9 codebook levels so you can hear the RVQ hierarchy. vampnet_experiments.py runs VampNet with four masking strategies (vamping, dense vamping, inpainting, beat-driven). Both require Python 3.9–3.11 and about 2 GB of disk space for model weights.

# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and set up
git clone https://github.com/mclemcrew/music-ai-reading-group.git
cd music-ai-reading-group/experiments/vampnet

# One-time setup (installs deps, downloads ~1.5 GB of models)
./setup.sh

# DAC codec quality comparison
uv run dac_experiments.py --input path/to/audio.wav

# VampNet masking experiments (wild = creative, faithful = polished)
uv run vampnet_experiments.py --input path/to/audio.wav --style wild
uv run vampnet_experiments.py --input path/to/audio.wav --style faithful