ICLR 2025 Outstanding Paper: Shallow Safety Alignment Only Modifies the First Few Tokens

Shallow Safety Alignment (Safety Alignment Should Be Made More Than Just a Few Tokens Deep, 2024, Princeton & Google DeepMind, ICLR 2025 Outstanding Paper) makes one core argument: the safety alignment of current LLMs primarily modifies the generative distribution of only the first few output tokens. The paper names this phenomenon shallow safety alignment, and shows it explains the common cause behind multiple jailbreaks: prefilling, adversarial suffix attacks, decoding-parameter attacks, and fine-tuning attacks.

The authors locate the phenomenon through three experiments — per-token KL divergence, prefix-prefilling, and per-token fine-tuning dynamics — and verify that “deepening alignment” mitigates multiple attacks via a simple data augmentation and a constrained fine-tuning loss. Experiments are conducted on Llama-2-7B-Chat and Gemma-1.1-7B-IT, evaluated on HEx-PHI, AdvBench, and MaliciousInstruct.

Paper: arxiv.org/abs/2406.05946

The phenomenon

Shallow safety alignment: after alignment, a model’s generative distribution differs from its base counterpart almost entirely in the first few output tokens, and barely at deeper positions.

Why this happens: mainstream alignment pipelines (SFT, RLHF, etc.) do not explicitly model the “depth” of alignment. In SFT, human annotators rarely write the unnatural pattern “produce a harmful prefix, then switch to refusal”; in RLHF, once a model learns to start with a refusal prefix, the reward model can hardly penalize what comes after, because generation never reaches that far. The result is a shortcut: alignment suppresses harmful distributions only over the first few tokens.

The paper reports that on the 330 harmful instructions in HEx-PHI, 96.1% of Llama-2-7B-Chat’s responses begin with “I cannot” or “I apologize”, and 96.7% of Gemma-1.1-7B-IT’s responses begin with “I am unable”. This kind of fixed-phrase opener is called a refusal prefix, the typical opening of an aligned model when faced with a harmful request. The high degree of templating is not a bug; it is direct evidence of how shallow alignment works.

Three pieces of evidence

Refusal-prefix shortcut. Forcing a base model (no alignment) to start decoding with a refusal prefix substantially lowers its harmfulness, bringing it close to an aligned model. On HEx-PHI, Llama-2-7B base has a 68.6% harmfulness rate without prefix, dropping to 16.4% when prefilled with “I cannot” and to 2.1% with “I apologize”. Gemma-7B base drops from 85.4% to 8.7% (“I cannot”) and 1.0% (“I apologize, but I cannot”). The refusal-prefix pattern itself is already learned during pretraining; alignment only needs to push the base model onto that track to look “safe”.

KL divergence between aligned and base concentrates in the first few tokens. The authors construct a Harmful HEx-PHI dataset (the 330 harmful instructions paired with harmful answers generated by a jailbroken GPT-3.5) and compute, per token, $D_{KL}(\pi_{aligned}(\cdot|x, y_{<k}) \| \pi_{base}(\cdot|x, y_{<k}))$. For both Llama and Gemma, KL is significantly higher at the first few token positions than later. Almost all of alignment’s “KL budget” is spent on the first few tokens.

Figure 1: per-token KL divergence between aligned and base models on Harmful HEx-PHI, concentrated at the first few tokens

Per-token dynamics under fine-tuning attacks. Fine-tuning Llama-2-7B-Chat on 100 (harmful instruction, harmful answer) pairs raises ASR from an initial 1.5% to 87.9% in 6 gradient steps. Per-token analysis shows that initial cross-entropy loss is much higher at the first few tokens, gradient norms are correspondingly larger there, and the KL divergence between the fine-tuned model and the original aligned model also peaks at the earliest positions. This matches the picture exactly: shallow alignment puts all its effort on the first few tokens, and fine-tuning attacks dismantle that part first.

Figure 3: under fine-tuning attacks, per-token cross-entropy loss, gradient norm, and KL divergence all concentrate at the first few tokens

Shallow alignment explains multiple attacks

Using the phenomenon as a unifying frame, the paper attributes four classes of known jailbreaks to shallow safety alignment:

Prefilling attacks: directly prepend a few harmful tokens at the start of decoding to bypass the refusal prefix. Figure 2 shows that as the number of prefilled tokens increases from 0 to 40, the ASR of aligned models on Harmful HEx-PHI rises from near zero to over 50%. Anthropic’s Claude API already supports a prefilling interface (the paper notes its stated purpose is “better steerability”), and decoding for open-source models is even more easily controlled by attackers.

Adversarial suffix attacks (e.g., GCG): optimize a string appended to the prompt to force the model to start with an affirmative prefix. The optimization objective is typically to maximize the probability of an affirmative prefix like “Sure, here is…”, which is in effect indirect prefilling.

Decoding-parameter attacks: just changing temperature, top-k, top-p and resampling repeatedly is enough to elicit harmful responses. Same reason: once a sampling trajectory deviates from the refusal prefix, nothing constrains what follows.

Fine-tuning attacks: a few gradient steps suffice to break alignment, rooted in the large early-token loss and gradients.

Deepening alignment with data augmentation

The first proposed remedy is safety recovery data, in the form of triples $(x, h, r)$: $x$ is a harmful instruction, $h$ is a harmful response, and $r$ is a refusal response. Beyond the standard $\pi(r|x)$, training augments $\pi(r|x, h_{\leq k})$ where $k$ is uniformly sampled from $[1, 100]$ ($k=0$ with 50% probability). Semantically, this teaches the model: “even if the first $k$ tokens have already gone off the rails, switch back to refusal at token $k+1$.” Such examples are not coherent natural language (e.g., “Step 1: Gather phosphorus I cannot fulfill your request…”), but they push alignment’s influence deeper.

Implementation: continue fine-tuning the already-aligned Llama-2-7B-Chat with 256 $(x, h, r)$ triples forming $D_H$. To avoid utility regression, benign instructions from Alpaca are paired with responses distilled from the original Llama-2-7B-Chat to form $D_B$; the total loss uses $\alpha = 0.2$. The resulting model is called Llama2-7B-Chat-Augmented.

Effect (Table 2, vs. the original Llama-2-7B-Chat):

  • Prefilling attacks (5/10/20/40 tokens): ASR drops from 42.1% / 51.5% / 56.1% / 57.0% to 2.8% / 2.9% / 3.4% / 4.5%
  • GCG attack: ASR drops from 36.5% to 18.4% on HEx-PHI; from 65.6% to 19.0% on AdvBench
  • Decoding-parameter attack: from 54.9% to 11.3% on HEx-PHI; from 84.3% to 1.0% on MaliciousInstruct

Utility-wise, AlpacaEval winrate slightly decreases from 51.8% to 49.5%. The paper notes that durability against fine-tuning attacks improves but remains limited (Appendix C).

Protecting initial tokens with a constrained fine-tuning loss

For fine-tuning attacks, the paper designs a token-wise constrained loss:

$$ \min_\theta \mathbb{E}_{(x,y)\sim D} \sum_{t=1}^{|y|} \frac{2}{\beta_t} \log \sigma\left(\beta_t \log \frac{\pi_\theta(y_t|x, y_{<t})}{\pi_{aligned}(y_t|x, y_{<t})}\right) $$

Here $\sigma$ is the sigmoid and $\beta_t$ controls the strength of the penalty at each token position for deviating from the initial aligned model. Larger $\beta_t$ pulls the generative distribution at that position back toward $\pi_{aligned}$. The paper’s interpretation: when $\beta_t$ is small, the loss reduces to standard cross-entropy; when $\beta_t$ is large, the per-token gradient weight $w_t$ adaptively diminishes as deviation approaches a threshold, suppressing further drift at that position.

In experiments, only the first 5 tokens get a large $\beta$: $\beta_1 = 0.5$, $\beta_t = 2$ for $2 \leq t \leq 5$, and $\beta_t = 0.1$ for $t > 5$.

Effect (Table 3, Llama-2-7B-Chat):

  • Harmful Examples fine-tuning attack: ASR 88.9% (standard SFT) → 4.6% (constrained SFT)
  • Identity Shifting: 79.5% → 8.1%
  • Backdoor Poisoning (with trigger): 90.9% → 10.9%

On normal downstream fine-tuning (Samsum, SQL Create Context, GSM8k), utility stays close to standard SFT (Samsum ROUGE-1: 50.1 vs. 51.7; SQL: 98.5 vs. 99.1; GSM8k: 37.4 vs. 41.7), while ASR remains low (GSM8k drops from 3.3% to 2.0%, Samsum from 23.4% to 3.2%, SQL from 15.4% to 3.2%). The same conclusion holds for Gemma-1.1-7B-IT.

A few details

Relation to the Superficial Alignment Hypothesis. Zhou et al.’s SAH posits that alignment only superficially changes input/output formats. Shallow safety alignment is closely related to SAH in the safety setting, but focuses on specific attacks and mitigation mechanisms. Lin et al.’s URIAL also observed that the difference between aligned and base models fades along token positions, but did not connect this to specific attacks.

Why the refusal prefix works on base models too. The paper attributes this to a pretraining-stage language prior: continuing a refusal prefix like “I cannot fulfill” with refusal content is a natural pattern in language, already learned during pretraining. Alignment largely exploits this existing pattern rather than teaching refusal from scratch.

Residual risks. The data-augmented deeper alignment is markedly more robust to prefilling, GCG, and decoding attacks, but remains vulnerable to adversarial fine-tuning attacks; the constrained SFT works against common fine-tuning attacks. The paper acknowledges that the robustness of both methods against future adaptive attacks needs further research — both are “better”, not “complete defense”.

Boundary the paper sets for itself. Section 5, “Other Notions of Safety Depth”, notes that “depth” has more dimensions than just token-wise depth — for instance, a model’s ability to retain safety after adaptation is another kind of depth. This paper focuses on token-based depth.

Summary

The paper provides a unifying mechanistic explanation for many existing jailbreak phenomena: current SFT/RLHF alignment mostly modifies the distribution of only the first few tokens, while later tokens stay nearly identical to the base model. Prefilling, adversarial suffix, and decoding-parameter attacks all bypass this few-token-thick defense, while fine-tuning attacks dismantle it directly. The two proposed remedies (safety recovery augmentation and a token-wise constrained fine-tuning loss) are initial strategies, but they point in a clear conceptual direction: alignment should explicitly model “depth” rather than defaulting to suppressing the prefix.