NeurIPS 2025 Best Paper Gated Attention: Non-linearity, Sparsity, and Attention Sink

The core idea of Gated Attention (Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free, 2025, Qwen Team, NeurIPS 2025 Best Paper): insert a head-specific, elementwise sigmoid gate after the Scaled Dot-Product Attention output. Trained on 1.7B dense and 15B MoE (A2.54B activated) models for 3.5T tokens each, the method reduces PPL by roughly 0.05–0.27 (depending on model and setting), substantially suppresses training loss spikes, improves long-context extrapolation, and dramatically mitigates the attention sink (BOS token attention drops from 46.7% to 4.8%).

The authors systematically compare 5 candidate positions and several gate forms (granularity, parameter sharing, multiplicative/additive, activation function), totaling 30 variants. The above configuration is the best across all settings.

Paper: arxiv.org/abs/2505.06708

Why gating

Gating in neural networks has a long history. LSTMs use input/forget/output gates to control information flow, Highway Networks use gates to modulate residual flow, and Mamba-like state-space models and various linear-attention variants treat gating as a standard component. But existing work rarely compares gating in different positions and forms in a controlled fashion.

Take Switch Heads as an example. The method adds sigmoid gating across attention heads for top-K routing; the gain is attributed to the routing mechanism. The authors of this paper reproduce the setting with the number of experts reduced to 1 (no routing) — the gate degenerates to a modulator on the value output, yet the gain remains. Similarly, DeepSeek’s NSA (Native Sparse Attention, ACL 2025 Best) also uses gating, but the paper attributes the gain mainly to the sparse-attention design without separately quantifying the gate’s contribution.

The research goal of this paper is: isolate gating from other confounding variables and systematically compare the effect of its insertion position and form.

Five positions × several forms = 30 variants

Five candidate gating positions (left); PPL and MMLU change for each position on 15B MoE (middle); training loss curves (right)

Five positions where a gate can be inserted:

  • $G_1$: after SDPA output, before concat
  • $G_2$: after value projection
  • $G_3$: after key projection
  • $G_4$: after query projection
  • $G_5$: after the final dense output layer

Each position can further be combined with multiple gate forms: elementwise vs headwise (granularity), head-specific vs head-shared (parameter sharing), multiplicative vs additive (operation), and sigmoid vs SiLU (activation function).

The main experiments use two model groups: a 1.7B dense model and a 15B MoE model (A2.54B activated), each trained from scratch on 3.5T tokens. The main grid covers 30 variants.

The optimal configuration: at position $G_1$ (after SDPA output), apply a head-specific, elementwise sigmoid gate produced from the current query projection. The code change is roughly ten lines:

1
2
3
4
5
6
# Standard attention output
attn_out = scaled_dot_product_attention(q, k, v)  # [B, H, T, D]
# New: per-head independent sigmoid gate
gate = sigmoid(q @ W_gate)  # [B, H, T, D], W_gate is head-specific
attn_out = attn_out * gate
# Followed by the usual dense layer

On the 1.7B dense model (28 layers, 3.5T tokens), PPL drops from 6.180 to 6.130 (about −0.05); on the 1.7B / 400B-tokens setting MMLU rises from about 50.21 to 51.15 (about +1 point). On the 15B MoE model, PPL drops from 6.026 to 5.761 (about −0.27) and MMLU rises from 58.79 to 60.82.

Why this position

The authors attribute $G_1$’s advantage to two factors: non-linearity and query-dependent sparsity.

Non-linearity. In standard attention, the value projection $W_v$ and the final dense projection $W_O$ are two consecutive linear maps that can be fused into a single low-rank linear map whose rank is bounded by head_dim. In other words, the only non-linearity attention introduces from input to output comes from softmax — and softmax acts on attention scores, not values. Adding a gate at $G_1$ or $G_2$ inserts a non-linear activation along this low-rank path, lifting expressiveness.

This also explains why $G_3$ and $G_4$ (gates on query/key) are weaker: query and key are followed by softmax, which already provides non-linearity, so the marginal benefit of an extra gate is small. The paper notes that $G_5$ (after dense) is also limited because it does not address the missing non-linearity between $W_V$ and $W_O$.

Sparsity. The authors plot the distribution of gate activations: the learned sigmoid gate is highly sparse — most tokens have gate values close to 0, only a few approach 1. This sparsity is not enforced explicitly; the model develops it on its own.

A further analysis shows the sparsity is query-dependent: each query token decides for itself how much value information passes through. In effect, the gate adds a hard token-level filter on top of the soft attention.

The two factors compound: non-linearity expands expressiveness, then a query-level filter suppresses noisy values.

The disappearance of attention sinks

Attention sink was identified by Xiao et al. in Efficient Streaming Language Models with Attention Sinks (2023): trained LLMs tend to allocate large amounts of attention weight to the first few tokens of the sequence (typically BOS), even when those tokens carry no semantic information. The phenomenon is universal across mainstream LLMs and is generally attributed to softmax’s requirement that probabilities sum to 1 — when no specific destination is needed, the model dumps weight at the start of the sequence.

Attention sinks have two costs: they consume attention capacity, and they constrain long-context extrapolation, since the model concentrates weight on the first few tokens during training but cannot meaningfully extend that pattern beyond the training length. Subsequent work (StreamingLLM, Sink Attention, etc.) has mostly worked around or exploited the sink rather than removing it.

Gated Attention substantially weakens the phenomenon without explicitly targeting it: with the sigmoid gate added, the bright “stripe” on BOS tokens in the attention map fades away.

Figure 2: top row is the baseline — first-token attention averages 46.7%, with a prominent stripe on BOS at layer 21; bottom row, after adding the SDPA output gate, drops to 4.8% and the stripe is largely gone

A mechanistic interpretation: in standard attention, softmax forces the probabilities to sum to 1, so when a row needs no value information, the model has nowhere to dump the weight except at some innocuous token. With a sigmoid gate, softmax can distribute probabilities normally and the gate can multiply by 0 to shut off the entire head’s contribution. This extra release valve frees softmax, and the sink is no longer necessary.

The direct payoff appears in long-context extrapolation. On the RULER long-context benchmark, the paper compares standard attention against gated attention; the latter degrades more gracefully beyond the training length (e.g., 32K).

Training stability

Gated attention also significantly reduces training loss spikes.

Loss spikes are a recurring problem at large training scale. Models in the 15B+ range occasionally experience sudden loss surges in mid-to-late training, which may diverge or require rolling back thousands of steps. Common workarounds include embedding scaling, QK norm, tighter gradient clipping, and lower learning rates.

The authors observe that with the SDPA output gate added, the training curves of both 1.7B and 15B models become noticeably smoother, and positions where spikes would otherwise appear no longer show them. This allows training with larger learning rates and a more linear scaling behavior.

The paper attributes this to massive activations in hidden states: the gate substantially reduces massive activations in the SDPA output, which in turn reduces BF16 numerical errors during training and suppresses spike formation (referencing Budzinskiy et al. 2025). The paper further traces massive activations as typically originating from layer-5 FFN outputs.

A few details

Why head-specific is necessary. A head-shared gate (one shared $W_{gate}$ across heads) is significantly weaker than head-specific. The authors attribute this to the fact that heads learn distinct semantic features; a uniform modulation would force their sparsity patterns to synchronize and erase head diversity. Head-specific adds only a small parameter overhead (one extra $\text{head_dim} \times \text{hidden_dim}$ matrix per layer), making the trade-off favorable.

Sigmoid beats SiLU. The paper mainly compares sigmoid and SiLU. SiLU is close but slightly weaker (PPL, MMLU, and GSM8k all marginally below sigmoid). Sigmoid’s advantage lies in its $[0, 1]$ output range and its natural tendency toward sparsity — many gate values are pushed near 0, realizing a hard filter.

Multiplicative beats additive. Switching the gate to addition ($\text{out} = \text{attn} + \text{gate}$) reduces the gain. The paper’s stated reason: the additive form at G1 still passes through a SiLU and gains some non-linearity, but its benefit is smaller than the multiplicative form.

Relation to NSA and SwiGLU. NSA already implicitly uses gating, but it is coupled with sparse-attention design and not separately ablated. SwiGLU in the FFN is also a form of gating, but it acts on the channel dimension; the gate in this paper sits inside attention and acts on the token dimension. The two positions do not conflict, and the paper confirms that adding gated attention on top of a SwiGLU model still yields gains.

Mitigation vs elimination of attention sink. The paper uses different wording in different places: the abstract says “mitigates”, while the introduction, Section 5.2, and conclusion say “eliminates”. The BOS attention proportion drops from 46.7% to 4.8% (and at layer 21 from 83% to 4%) — the effect is close to elimination in practice, though not strictly zero.

Summary

Gated Attention adds a head-specific, elementwise sigmoid gate after the Scaled Dot-Product Attention output, with about ten lines of code changed. Across a systematic comparison of 30 variants, the configuration achieves a substantial PPL drop, a sharp reduction in training loss spikes, improved long-context extrapolation, and a dramatic weakening of attention sinks.

The methodological contribution is to isolate gating from the many confounding designs in prior work and evaluate its position and form independently. Switch Heads attributed gains to routing and NSA attributed gains to sparse-attention design; the experiments here show that gating itself contributes a substantial portion of the gain in both cases.