ACL 2025 Best Paper: LLMs Resist Alignment via Elasticity
A safety-aligned model from SFT/RLHF can be turned back into a harmful one with a few hundred examples; even a benign customer-service SFT round will quietly drop the refusal rate. Why is alignment this brittle?
One of the ACL 2025 best papers, Language Models Resist Alignment: Evidence From Data Compression, attributes it to elasticity: alignment fine-tuning does not really rewrite the model’s internal representations, it only nudges the output distribution away from the pre-training distribution temporarily. Under a reverse fine-tune, the model rebounds toward pre-training much faster than it moved toward alignment. Treating an LM as a lossless compressor, the paper derives that the change in compression rate is inversely proportional to dataset size; alignment data is orders of magnitude smaller than the pre-training corpus, so the constraint is naturally weaker.