NeurIPS 2025 Best Paper Runner-up: Explaining Scaling Laws via Superposition

Publish on: 2026-06-03 Classify at: NLP Read:≈ 10min Views: Comments:

Neural scaling laws $L \propto N^{-\alpha}$ hold across model families, datasets, and tasks with strikingly consistent exponents. But why this power law, and where does the exponent come from? The mechanism has stayed unclear. Most existing explanations follow the same template: assume that feature/skill importance in the data follows a power-law distribution, and the loss naturally becomes a power law. The blame goes to the data.

NeurIPS 2025 Best Paper Runner-up Superposition Yields Robust Neural Scaling traces the answer back to geometry. The hidden dimension $m$ of an LLM is much smaller than the number of features $n$ it must represent. The $n$ feature vectors are forced into an $m$-dimensional space, and the typical interference between them scales as $1/m$, which means the loss also drops as $1/m$. The paper adds a knob to Anthropic’s toy model that independently tunes the strength of superposition. As long as superposition is strong enough, the $1/m$ law holds robustly, almost regardless of the feature frequency distribution. In short: the exponent and robustness of scaling laws come from the geometry of the $m$-dimensional sphere, not from the data.

Paper: arxiv.org/abs/2505.10465

Existing Explanations of Scaling Laws

Previous explanations fall roughly into two camps:

Manifold / function fitting view: larger models cover the data manifold better, and loss is dominated by what fails to get covered. The exponent depends on how “thick” the data is or how fast features decay.
Discrete skills / feature learning view (e.g. Quanta-style models): the network learns countable “skills” one by one, with skill importance following a power law, and larger models cover more skills.

Both share a common assumption: the data itself must have power-law feature frequencies or skill importance for scaling to be a power law. This is “power law in, power law out”.

But this doesn’t match reality. Real LLMs show nearly identical scaling across tasks and datasets, regardless of data shape. Old explanations don’t cover this robustness.

What is Superposition

Before following the paper’s argument, let’s clarify superposition. The concept comes from Anthropic’s 2022 toy model paper: a neural network’s hidden dimension $m$ is finite, but the number of features $n$ in language (tokens, concepts, compositions) is much larger. If each feature is bound to a hidden vector $W_i$, the $n$ vectors cannot be pairwise orthogonal. Trade-offs are forced.

Two ways to compromise:

Weak superposition: keep only the $m$ most important features and give them a near-orthogonal basis. Drop the remaining $n - m$ features (set $W_i = 0$) and rely on the global mean for output.
Strong superposition: assign a non-zero $W_i$ to all $n$ features, but the vectors necessarily have non-zero inner products and interfere with each other. The model uses ReLU + negative bias to do “error correction”, squashing small interference to zero.

When does strong superposition pay off? When features are sparse (only a few activate per sample), interferences are unlikely to all fire at once, error correction works, and representing everything beats dropping half. Natural language naturally satisfies this (a sentence uses far fewer tokens than the vocabulary), so LLMs operate in the strong superposition regime.

The connection to scaling laws is the paper’s central claim: the model’s loss depends directly on how features are packed into the hidden space. How crowded the packing is, and what the vector geometry looks like, determine how loss scales with $m$.

Toy Model and the Sweep Setup

To make “superposition strength” a controllable variable, you need a model simple enough to analyze but capturing two key LLM features: (1) features far outnumber the hidden dimension, (2) feature frequencies are uneven. The paper uses Anthropic’s autoencoder toy model:

Input $x \in \mathbb{R}^n$, each component $x_i = u_i v_i$, where $u_i \sim \mathrm{Bernoulli}(p_i)$ controls activation and $v_i \sim U(0, 2)$ controls activation strength. The model compresses $x$ into $m \ll n$ dimensions and reconstructs, with parameters $W \in \mathbb{R}^{n \times m}$ plus bias $b$. The $i$-th row $W_i$ is feature $i$’s representation vector. Loss is reconstruction error $|y - x|_2^2$. Feature frequencies follow a power law $p_i \propto 1/i^\alpha$, with exponent $\alpha$ controlling how steep the distribution is: large $\alpha$ means a few high-frequency features dominate; small $\alpha$ means more uniform frequencies.

The paper’s key modification is introducing decoupled weight decay to independently control superposition strength. Positive weight decay squashes most $W_i$ to 0, keeping only a few features (weak superposition). Negative weight decay pulls all $|W_i|$ toward 1, so every feature gets represented (strong superposition). With this knob, “superposition strength” and “frequency distribution” become two independently tunable variables, and one can study how loss scales with $m$ separately under each.

Weak Superposition: Power Law In, Power Law Out

In weak superposition, the top $m$ most frequent features are represented (the paper defines “represented” as $|W_i|^2 > 1/2$, with the corresponding fraction $\phi_{1/2}$ measured to be $\approx m/n$), and the rest are dropped. Loss equals the sum of dropped feature frequencies:

$$L \approx \langle v^2 \rangle \sum_{i > \phi_{1/2} n} p_i$$

Plugging in $p_i \propto 1/i^\alpha$ (requires $\alpha > 1$ for the integral to converge) and integrating gives $L \propto m^{-(\alpha-1)}$, i.e. model exponent $\alpha_m = \alpha - 1$. This is exactly what previous explanations conclude, meaning they only cover the weak superposition case.

The conclusion is also clear: whether loss is a power law depends entirely on whether data frequencies are a power law.

Strong Superposition: The $1/m$ Law from Geometry

Under strong superposition, the source of loss changes. All features are represented, so there’s no “dropped” term; loss comes entirely from overlaps between representation vectors. Consider the simplest case where only feature $j$ activates: ideally the output $y_i$ should equal $x_i$ (non-zero only at $i = j$), but actually $y_i = \mathrm{ReLU}(W_i \cdot h + b_i) = \mathrm{ReLU}(W_i \cdot W_j + b_i)$. For non-activated $i \neq j$, the interference term $W_i \cdot W_j$ leaks into the output, and loss scales as $(W_i \cdot W_j)^2$.

So the problem reduces to pure geometry: $n$ unit vectors packed into an $m$-dimensional space cannot be pairwise orthogonal, so how large is the typical squared cosine of pairwise angles? This quantity directly determines loss, independent of training dynamics or frequency distribution.

Two independent geometric facts sandwich the answer:

Random toss (lower bound): $n$ unit vectors uniformly randomly placed on the $m$-dimensional sphere have pairwise squared cosines following $\mathrm{Beta}(1/2, (m-1)/2)$, with mean exactly $1/m$ and variance $\sim 2/m^2$. This is the “near-orthogonality” of high-dimensional space: the larger $m$ is, the more two random directions tend toward perpendicularity. Even without any optimization, the magnitude is $1/m$.
Optimal packing (upper bound): constructing $n$ vectors to minimize the “maximum pairwise overlap” is an optimization problem with an analytic solution, and the optimum is called an Equiangular Tight Frame (ETF; the paper writes “equal angle tight frame”, same meaning). The lower bound on the maximum overlap is $\approx 1/\sqrt{m}$, whose square is still $1/m$. Even if you optimize hard, packing vectors as uniformly as possible, you cannot push overlap below $1/m$.

Both the lower bound (random toss) and upper bound (optimal packing) land at $1/m$, so regardless of what the trained $W_i$ look like, random or ETF-like, squared cosines cannot escape this magnitude.

The paper confirms this empirically: the $W_i$ with norm greater than 1 (about $m^2/2$ of them, corresponding to the more important features) have variance much smaller than random vectors, with mean squared overlap landing precisely at $1/m$. The remaining less important features can’t be described by a single theory, but their measured squared overlaps still robustly follow $1/m$.

Both sides give $1/m$, so loss naturally drops as $1/m$, almost regardless of frequency distribution (assuming $\alpha$ isn’t too large; sufficiently skewed distributions enter the regime below). This is why scaling laws are robust: the exponent isn’t stolen from the data, it’s a geometric property of the $m$-dimensional sphere itself.

The only exception is when frequencies are highly skewed (large $\alpha$): the geometry of important features gets compressed to ETF-like with small contribution, the less important features start to dominate, and the exponent deviates from 1. The paper’s extreme-case estimate is $\alpha_m \approx 2(\alpha - 1)$.

The paper packs the weak vs. strong contrast into a single figure: the two left subplots are weak superposition, where loss curves take different slopes for different frequency exponents $\alpha$; the two right subplots are strong superposition, where loss curves under every $\alpha$ collapse onto the same $1/m$ line. The gray dots are real LLMs, sitting on that same line.

Comparing with Real LLMs

The paper studies four model families (OPT, GPT2, Qwen, Pythia), treating tokens as “atomic features” and analyzing the $W$ matrix of the language model head. Two checks:

Does the normalized row-vector mean squared overlap drop as $1/m$? Yes.
Is cross-entropy loss approximately linear in $1/m$? Also yes.

Direct fitting gives $\alpha_m = 0.91 \pm 0.04$. Inferring from Chinchilla, with model size $N \propto m^{2.52}$ and $\alpha_N = 0.35$, gives $\alpha_m \approx 0.88$. Both are close to 1, consistent with strong superposition theory.

Token frequencies are approximately Zipf ($\alpha \approx 1$), in the “flat” range, which falls exactly in the $\alpha_m \approx 1$ stable region.

The paper plots all four model families on one figure: left is the scatter of language-model-head squared overlaps vs. $1/m$, right is the cross-entropy loss vs. $1/m$ fit. Colors mark model families, shapes mark evaluation datasets, slopes are nearly identical across all of them, and the reported $\alpha_m = 0.91 \pm 0.04$ is the joint regression over every line on the right.

$Real LLM verification: (a) language model head squared overlaps drop as $1/m$, (b) model-related loss fitting gives $\alpha_m = 0.91 \pm 0.04$, consistent across model families and evaluation sets$

A Few Falsifiable Predictions

Once robustness is grounded in geometry, three long-standing questions fall out naturally:

When does the scaling law break: as soon as representations are disentangled and you leave the strong superposition regime, the $1/m$ law collapses. Or when the model dimension $m$ approaches vocabulary size (under the extreme assumption features = tokens), the representational bottleneck disappears, and width-direction loss stops being a power law. Power laws are a product of “geometric crowding”; remove the crowding and they vanish.
Can the exponent $\alpha_m > 1$ be achieved: theory caps it at $\alpha_m \approx 1$; steeper exponents require severely skewed feature frequencies. Natural language is Zipfian ($\alpha \approx 1$), so the exponent is pinned near 1. This explains why everyone’s measured scaling exponents look so similar; it’s not a coincidence.
How does depth enter (conjecture, not yet proven): the paper conjectures total loss splits as $f_m(m) + f_\ell(\ell)$, with the width part dominated by superposition. At Chinchilla-optimal allocation, the two parts must balance, so the measured $\alpha_m \approx 1$ reflects joint width+depth optimization, not the width-only limit.

Limitations

The toy model is an autoencoder, not a transformer. Treating tokens as atomic features is a first-order approximation; in real LLMs, “features” are more likely token combinations or abstract concepts, which the paper acknowledges as a simplification.
Under strong superposition, the configuration of important $W_i$ is only qualitatively described via ETF analogy, with no rigorous analytic solution, so training dynamics (how loss evolves over steps) cannot be explained.
Scaling along the data / training-step dimension is not covered. The paper conjectures that data scaling in the strong superposition regime depends on “how angle distributions evolve”, but more refined analytical tools are needed.
The empirical LLM loss extrapolated as a linear function of $1/m$ doesn’t pass through the origin; the residual is attributed to “intrinsic uncertainty of language itself”, but no independent evidence is provided.

Summary

This paper connects two research lines that had been running in parallel: Anthropic’s interpretability / superposition line, and Kaplan / Chinchilla’s empirical scaling laws line. “Why scaling laws are robust” moves from an empirical observation to a geometric mechanism, with a handful of falsifiable predictions: change the data distribution, disentangle representations, or break the vocabulary scale, and the power law warps. Most prior scaling-law work has been about fitting curves; few attempts to explain from microscopic mechanisms, and this is one of the worthwhile ones.