Harmonic is a three-level hierarchical state space model for language modeling. It processes fast, medium, and slow temporal patterns in parallel — outperforming Transformers at long context while using constant memory at inference.
Each level operates at a different temporal resolution — capturing local syntax, phrase structure, and long-range discourse simultaneously. Inter-level signals pass compressed states upward and refined predictions downward.
Equal token budget (65.5M tokens), equal parameter count. The advantage is not an artifact of more training — at 5× the headline budget the crossover holds.
| Seq | Transformer | Mamba | Harmonic | H–TF gap |
|---|---|---|---|---|
| 1 024 | 6.662 | 6.616 | 6.571 | +1.4% |
| 2 048 | 6.657 | 6.532 | 6.426 | +3.5% |
| 4 096 | 7.045 | 6.740 | 6.687 | +5.1% |
| 8 192 | 6.787 | 6.422 | 6.333 | +6.7% |
| 16 384 | 6.873 | 6.286 | 6.196 | +9.9% |
| 32 768 | 7.259 | 6.549 | 6.433 | +11.4% |
| 65 536 | OOM | OOM | 6.169 | — |
Five independent seeds, seq=8192. Confidence intervals do not overlap. The ranking Harmonic < Mamba < Transformer is consistent across all seeds.
At 20K steps (5× the headline budget), Harmonic still wins at seq=8K. Transformer wins at short context — expected, attention is optimal for short sequences.
The hierarchy is critical. Flat timescales (all τ equal) cost +0.50 bpt. Removing inter-level prediction-error signals has negligible effect (≤0.022 bpt) — the timescale structure itself drives performance.
Harmonic carries its raw SSM state across chunk boundaries — enabling truly unbounded context with a fixed memory footprint. Stateful fine-tuning consistently improves performance.
Sounds very interesting! I would suggest you try to get this peer reviewed directly for e.g. a workshop submission.
Full architecture details, training procedure, ablation studies, and reproducible experiments. arXiv submission pending moderation.