Music Transcription | Music AI Reading Group

What is Automatic Music Transcription?

Automatic Music Transcription (AMT) is the task of converting raw audio into a symbolic representation, most commonly a piano roll, a binary matrix where rows are pitch classes and columns are time frames. A cell is 1 if that pitch is active at that time, 0 otherwise. It can be understood as the audio analogue of speech recognition, in that both map a continuous acoustic signal to a discrete symbol sequence, but AMT additionally requires resolving polyphony (many pitches simultaneously), precise sub-50 ms onset timing, and optionally instrument identity. Unlike speech, there is no agreed-upon single corpus or evaluation protocol, which is part of why the field moves in bursts.

This session covered three papers that together map a decade of progress on this problem:

Piano Roll

The canonical AMT target: a binary matrix with 88 rows (piano keys, or all MIDI pitches) and time on the X-axis at ~10–32 ms resolution. A 1 means that pitch is sounding. All three papers ultimately predict a piano roll, differing only in how they get there.

Onsets Are Special

A note's attack is acoustically distinctive, characterized by broadband energy, a sharp transient, and peak amplitude. All three papers take advantage of this property, whether via explicit dual-objective training (Onsets and Frames), implicit attention patterns (MT3), or a separate onset head (NMP). The onset is the most information-dense region of the signal.

Music as Tokens

MT3 recasts transcription as a sequence-to-sequence problem. Rather than predicting a dense piano roll matrix, MIDI events become a stream of approximately 400 discrete tokens (Instrument, Time, Note, On/Off, EOS) that a T5 Transformer generates autoregressively, the same architecture used for machine translation.

Harmonic Stacking

NMP's efficiency comes from the input representation. Rather than giving the network a wide mel-spectrogram, the method constructs 8 CQT copies shifted by harmonic intervals. Harmonics that occupied different frequency bins now align in a vertical column, so a 3×3 convolution kernel can detect the full harmonic series without a large receptive field. This is what allows 16K parameters to be sufficient.

Onsets & Frames: The Dual Objective

Hawthorne et al. (2017/2018) established the dominant paradigm for piano transcription for several years after publication. The central observation is that note onsets (the first ~32 ms of a note) are acoustically distinctive and reliably detectable, whereas sustain frames are ambiguous (the spectrum flattens, notes decay, overtones interfere). Rather than training a single model to handle both, the paper trains two specialized detectors jointly and uses the onset detector’s output to gate the frame detector at inference time.

The architecture is a CNN acoustic frontend (shared) feeding into two bidirectional LSTM branches: one produces an 88-dimensional onset probability map per frame, the other produces an 88-dimensional frame probability map per frame. During inference, a frame pitch can only activate if the onset detector also fired at that pitch within a small temporal window. This removes spurious sustained note predictions, which are the most common failure mode for single-model approaches.

Onset Gating

frame only

onset gated

Left: frame detector fires ghost tails (red fade) after the note ends — acoustic decay keeps the frame posterior high. Right: onset gating eliminates them — a frame pitch only activates if the onset detector also fired. Ghost tails shown on three arbitrary notes for illustration; in practice any note with sufficient decay can produce them.

The total loss is L = L_onset + L_frame, both binary cross-entropy over the 88-pitch output at each frame. We note that onset ground truth labels are weighted 5× during the first 32 ms of each note (the onset region), decaying back to 1× as the note sustains, which concentrates the onset detector's supervision on the frames where its predictions matter most. The model achieves 82.3% note F1 on MAPS, a more than 100% relative improvement over prior CNN-based methods.

MT3: Music as Language

Gardner et al. (2021) pose a different question. Rather than designing a transcription-specific architecture, can a general-purpose sequence-to-sequence Transformer handle AMT across all instruments and all datasets simultaneously? Their answer is affirmative, and the salient ingredient is a carefully designed token vocabulary.

Audio is split into non-overlapping 2.048-second segments. Each segment is encoded as a log mel spectrogram and passed through a T5-small encoder (60M parameters total). The decoder then autoregressively generates a sequence of tokens that describe every musical event in that time window.

MT3 Token Vocabulary

Instrument Time On/Off Note EOS

Token sequence for a C–E–G major chord. ~400 tokens total in MT3's vocabulary encode every instrument, pitch, timing, and control event.

The 400-token vocabulary encodes: 128 instrument IDs (General MIDI programs), 205 time bins (10 ms resolution over 2.048 sec), 128 note pitches, Note-On/Note-Off toggles, 128 drum events, an End Tie Section marker (for notes spanning segment boundaries), and EOS. Velocity is deliberately excluded because no existing multi-instrument dataset annotates it consistently, and including it would break the mixed-dataset training. We note one further design choice: when a note spans a segment boundary, the next segment begins with explicit "tie declarations" listing which pitches are already active, so the model does not lose a note mid-sustain.

The other salient contribution is temperature-weighted mixture training. Training on 6 datasets simultaneously is only useful if the small datasets (GuitarSet: 3 h, URMP: 1.3 h) get adequate gradient signal. Without reweighting, MAESTRO (200 h) would dominate every batch. MT3 borrows the temperature-weighted sampling formula from mT5 (multilingual NLP):

P(sample from dataset i) ∝ n_iΣ_j n_j^τ

Temperature-Weighted Mixture Sampling

τ 0.30

high-resource low-resource

τ = 1.0 samples in direct proportion to dataset size, so tiny datasets nearly disappear. τ = 0.3 (MT3's setting) compresses the range, giving URMP's 1.3 h roughly the same representation as MAESTRO's 200 h.

With τ = 0.3, URMP's 1.3 hours is sampled at roughly 10% of total probability, a 30× boost over the τ = 1.0 (size-proportional) baseline. This is the principal driver of MT3's 260% relative improvement on low-resource datasets compared to single-dataset training. Mixture training does not degrade high-resource performance (MAESTRO scores are essentially unchanged) but raises the floor for small datasets substantially.

NMP: The Efficiency Frontier

Bittner et al. (2022) address a practical constraint that neither O&F nor MT3 considers directly, namely that deployed models need to be small. Onsets and Frames requires 18M parameters and 5.4 GB memory to transcribe a 7-minute file. MT3’s T5 requires 60M parameters and autoregressive decoding. Neither runs on a phone. The NMP (Note and Multipitch Estimator) model achieves competitive results in 16,782 parameters, roughly 1,000× smaller, by exploiting a structural property of audio that larger models must learn from data.

The method relies on harmonic stacking. Rather than using a mel-spectrogram (which has uniform frequency resolution and forces the network to learn harmonic relationships from data), NMP uses a Constant-Q Transform (CQT, which has logarithmic frequency resolution) and constructs 8 copies of it, each shifted by a harmonic interval. This transforms harmonic relationships from a learned pattern into a structural property of the input.

Harmonic Stacking

A note at C4 (262 Hz) produces harmonics at 2f, 3f, … 7f, spread across 3.8 octaves of the CQT. In the raw spectrogram, detecting all harmonics requires a kernel spanning hundreds of bins. harmonic_stack() shifts each channel by log2(k) × bins_per_semitone × 12, aligning every harmonic at the fundamental's bin position. A 3×3 kernel now captures the full harmonic series.

After stacking, any pitch's harmonic series appears as a vertical column in a single time frame, visible to a 3×3 convolution kernel regardless of absolute frequency. The network no longer needs the capacity to discover this structure; it is encoded in the representation itself.

The stacked input feeds into a lightweight 3-head CNN. Each head solves one sub-problem of transcription:

Training all three heads jointly acts as a regularizer: the contour head preserves fine pitch structure, the note head quantizes to semitones, and the onset head detects attacks. The onset head's output gates the note predictions, the same principle as Onsets and Frames but in a model 1,000× smaller. NMP achieves 71% onset+offset F-score on GuitarSet, 25 points below O&F's 95% on MAPS (a simpler, largely synthetic piano benchmark). The tradeoff is explicit: some accuracy on any single instrument in exchange for a single deployable model that generalizes across instruments, which is the direct precursor to Spotify's open-source Basic Pitch tool.

How They Connect

These three papers represent three distinct design philosophies applied to the same problem. Onsets and Frames encodes domain knowledge directly as architecture, hard-coding the insight that onsets and frames need different detectors; the gating rule is an explicit algorithmic choice, not a learned one. MT3 takes the opposite position: given enough data and a sufficiently general model (T5), the relevant structure will emerge from training, and the dual-objective insight appears implicitly in the attention weights without being specified. NMP’s position is more nuanced, encoding harmonic structure in the input representation (harmonic stacking) rather than in the architecture, keeping the network itself simple enough to fit in 16K parameters.

One thread running through all three: the dataset problem. Even the best models in this comparison are trained on at most a few hundred hours of audio, compared to thousands for competitive ASR. MAESTRO, the largest single-instrument AMT dataset, has ~200 hours; URMP has 1.3 hours. MT3’s mixture training is an explicit response to this constraint. Until larger, higher-quality, multi-instrument datasets exist (with careful onset timing, velocity annotation, and multi-microphone conditions), there is a ceiling on what any architecture can achieve, and that ceiling may be lower than the numbers above suggest.

Run It Yourself

The script below implements the core NMP/Basic Pitch architecture from scratch in PyTorch, including harmonic stacking, the 3-head CNN, BCE loss with class balancing, and onset-then-trace decoding, and trains it on three synthetic audio experiments. The pattern mirrors the DDSP companion: ground-truth audio is fixed, the model learns to predict note events from it, and WAV snapshots are saved at regular intervals to make convergence audible.

# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and set up
git clone https://github.com/mclemcrew/music-ai-reading-group.git
cd music-ai-reading-group/experiments/transcription

# Install dependencies (uv reads pyproject.toml automatically)
uv sync

# Run all three experiments (~5 min on CPU, faster on GPU/MPS)
# Outputs go to static/audio/ with nmp_ prefix
uv run nmp_from_scratch.py

Experiment 1: Single note C4

Target: 261.63 Hz with 6 harmonics. The simplest case, with one pitch, one second, and one active semitone in the 88-bin output. The model should converge quickly; if it does not, something is wrong with the harmonic stacking.

Loading audio players...

Experiment 2: C major scale

Target: C4 D4 E4 F4 G4 A4 B4 C5 in sequence across 2.2 seconds. The model must generalize across 8 different pitches, as it cannot memorize a single frequency region. Convergence is slower; watch the note posterior at step 500 to see which pitches emerge first.

Loading audio players...

Experiment 3: C major chord (C4-E4-G4)

Target: three simultaneous notes. This is the most challenging case, since G4’s 2nd harmonic (783 Hz) sits near C5’s fundamental (523 Hz), and their harmonic series share overtones. Harmonic stacking separates these because each pitch creates a distinct vertical stripe in the stacked input. Note how many steps it takes for all three notes to emerge in the predictions.

Loading audio players...