NeurIPS 2025 Best Paper Runner-up: Explaining Scaling Laws via Superposition
Neural scaling laws $L \propto N^{-\alpha}$ hold across model families, datasets, and tasks with strikingly consistent exponents. But why this power law, and where does the exponent come from? The mechanism has stayed unclear. Most existing explanations follow the same template: assume that feature/skill importance in the data follows a power-law distribution, and the loss naturally becomes a power law. The blame goes to the data.
NeurIPS 2025 Best Paper Runner-up Superposition Yields Robust Neural Scaling traces the answer back to geometry. The hidden dimension $m$ of an LLM is much smaller than the number of features $n$ it must represent. The $n$ feature vectors are forced into an $m$-dimensional space, and the typical interference between them scales as $1/m$, which means the loss also drops as $1/m$. The paper adds a knob to Anthropic’s toy model that independently tunes the strength of superposition. As long as superposition is strong enough, the $1/m$ law holds robustly, almost regardless of the feature frequency distribution. In short: the exponent and robustness of scaling laws come from the geometry of the $m$-dimensional sphere, not from the data.