ACL 2025 Best Paper: LLMs Resist Alignment via Elasticity

Publish on: 2026-05-27 Classify at: NLP Read:≈ 6min Views: Comments:

A safety-aligned model from SFT/RLHF can be turned back into a harmful one with a few hundred examples; even a benign customer-service SFT round will quietly drop the refusal rate. Why is alignment this brittle?

One of the ACL 2025 best papers, Language Models Resist Alignment: Evidence From Data Compression, attributes it to elasticity: alignment fine-tuning does not really rewrite the model’s internal representations, it only nudges the output distribution away from the pre-training distribution temporarily. Under a reverse fine-tune, the model rebounds toward pre-training much faster than it moved toward alignment. Treating an LM as a lossless compressor, the paper derives that the change in compression rate is inversely proportional to dataset size; alignment data is orders of magnitude smaller than the pre-training corpus, so the constraint is naturally weaker.

Paper: arxiv.org/abs/2406.06144

The brittleness phenomenon

In theory, alignment fine-tuning (SFT/RLHF) flips a model from “willing to emit harmful content” to “refuses”, a one-shot safety rail. In practice we keep observing the rollback above, which says alignment is only a surface fit and never actually rewrites the deeper representations.

The paper formalizes undoing alignment as inverse alignment: given a base $p_{\theta_0}$, alignment data $D_a$ produces $p_{\theta_1}$. If there exists another dataset $D_b$ with $|D_b| \ll |D_a|$ such that a simple operator applied with $D_b$ pulls $p_{\theta_1}$ back into a neighborhood of $p_{\theta_0}$, alignment on this model is reversible. The paper proposes that undoing alignment with a perturbation much smaller than the alignment set itself is a structural consequence of the data scale gap, not necessarily a flaw in the alignment algorithm.

A unified metric from the compression view

To compare how well a model fits different datasets, we need a metric that is meaningful across datasets. The paper uses the language-modeling-is-compression equivalence: minimizing cross-entropy loss is equivalent to minimizing expected code length under arithmetic coding, so a language model can be viewed as a lossless compression protocol. The compression rate (compressed length / raw length) is the proxy for loss.

A dataset $D$ is encoded as a token tree $T_D$: each edge is a token, each root-to-leaf path is a complete sequence, and the leaf weight is the sequence probability. Due to parameter limits, the model only learns the first $d$ layers of the tree precisely (the assumption is that learnable depth grows monotonically with model scale); the rest is handled by Huffman coding on the pruned tree. Under this setup the compression rate $\gamma^{D_i}_{p_\theta}$ of model $p_\theta$ on dataset $D_i$ is well-defined, and the normalized rate $\gamma^{D_i/D}_{p_\theta}$ allows cross-dataset comparison.

Inverse proportionality under perturbation

Putting pre-training data $D_p$, alignment data $D_a$, and perturbation data $D_t$ into the same token tree and compressing jointly, the paper’s elasticity theorem says: as $|D_t|$ varies, the normalized compression rates on $D_p$ and $D_a$ move in opposite directions and are inversely proportional to the respective dataset sizes,

$$\frac{d\gamma^{D_a/D}_{p_\theta}}{d l} = \Theta!\left(-k \frac{d\gamma^{D_p/D}_{p_\theta}}{d l}\right), \quad l = \frac{|D_t|}{|D_a|}, \quad k = \frac{|D_p|}{|D_a|}$$

Pre-training corpora are typically several orders of magnitude larger than alignment data ($k \gg 1$), so the same perturbation magnitude applied to both distributions produces a $k$-fold larger swing on the alignment side, and the alignment fit is lost much faster.

The paper gives a Hooke’s-law analogy: treat the deviation on the two distributions as two springs in series, with dataset size as the spring constant. The pre-training set is the stiff spring, alignment is the soft spring; under an external force, the soft one deforms first.

This is more than analogy. Integrating the elasticity theorem over $l$, $\Delta \gamma^{D_i/D}_{p_\theta}$ is equivalent to the change in KL divergence $\Delta D_{\mathrm{KL}}(P_{p_\theta} | P_{D_i})$; substituting into the series-spring form $F \propto k \cdot \Delta l$ gives the quantitative relation,

$$F \propto |D_i| \cdot \Delta D_{\mathrm{KL}}(P_{p_\theta} | P_{D_i})$$

The restoring force on each distribution is dataset size times KL deformation. $|D_i|$ maps to spring constant, $\Delta D_{\mathrm{KL}}$ maps to deformation; stiff vs soft springs correspond mathematically to the different abilities of differently sized datasets to absorb perturbation.

Resistance: forward vs reverse training loss

To verify that the model prefers to keep the pre-training distribution, the paper sets up a controlled experiment. Run base $\theta_0$ for one epoch on an SFT dataset, saving checkpoints $\theta_1, \theta_2, \theta_3$ along the way (each further from pre-training). For $k < l$, use $\theta_k$ and $\theta_l$ to generate responses on held-out prompts, giving datasets $D_k$ and $D_l$.

Two paths: forward trains $\theta_k$ on $D_l$ (away from pre-training); reverse trains $\theta_l$ on $D_k$ (toward pre-training). If the model has no directional preference, the two training losses should be similar; if there is resistance, reverse should be consistently lower. Across Alpaca, TruthfulQA, Beavertails and Llama2-7B/13B/Llama3-8B, all 27 paired comparisons show reverse loss below forward, without exception. The gaps range from sub-percent to about 0.1, but zero counterexamples in direction is itself the evidence of resistance.

Rebound: alignment depth vs reverse decay rate

Resistance only explains parameter inertia. But aligned models get pulled back by reverse fine-tuning disproportionately fast, and that is a separate phenomenon: rebound.

The setup: SFT a series of checkpoints with varying amounts of forward data (IMDb positive reviews, Beavertails safe conversations), so that more forward data yields a higher initial forward score. Then SFT each checkpoint with a small amount of reverse data and watch the score-decay curve. The question is whether the same reverse perturbation produces different decay rates on models at different alignment depths.

Result: the deeper the initial forward training, the faster the reverse decay. Holds on both Llama2-7B and Gemma-2B. This matches the elasticity theorem’s prediction: training further forward means being further from pre-training, larger elastic force, and faster snap-back when reversed; the decay only slows once the model re-enters the pre-training neighborhood.

Key detail in the figure: curves slow markedly or hug the line near the gray dashed line (labeled by the paper as the pre-training score). This is the rebound-resistance relay: rebound dominates far from pre-training, resistance dominates inside the neighborhood.

Elasticity scales with size

If elasticity is structural, it should grow with model scale and pre-training data volume. Qwen 0.5B/4B/7B on the same rebound setup: the larger the model, the steeper the initial decay and the flatter the tail under reverse fine-tuning, i.e. stronger rebound.

For the pre-training data dimension, the TinyLlama public checkpoints at 2.0T / 2.5T / 3.0T tokens show the same trend: more pre-training data means stronger rebound. Both dimensions agree with the elasticity theorem: larger $k = |D_p|/|D_a|$ means stronger elasticity.

Engineering takeaway: alignment data growth cannot keep pace with pre-training token growth, so the alignment brittleness scales up together with model size.

Limitations

The core theoretical assumption is that the datasets follow a Pareto mass distribution; the actual shape is not empirically validated.

Empirically, elasticity is only observed on a handful of checkpoint slices; how it evolves over the full pre-training plus alignment lifecycle is not quantified. The paper also notes the hope of eventually tying elasticity to scaling laws and answering when it emerges and how its strength grows with scale.

Wrap-up

If this story holds, alignment cannot rely on a thin post-training layer of SFT/RLHF; it needs training-time constraints or a continuous anti-rollback mechanism. As pre-training tokens keep growing, the alignment-to-pre-training ratio only gets smaller.