Finisky Garden

NLP, Software Engineering, Product Design

0%

A safety-aligned model from SFT/RLHF can be turned back into a harmful one with a few hundred examples; even a benign customer-service SFT round will quietly drop the refusal rate. Why is alignment this brittle?

One of the ACL 2025 best papers, Language Models Resist Alignment: Evidence From Data Compression, attributes it to elasticity: alignment fine-tuning does not really rewrite the model’s internal representations, it only nudges the output distribution away from the pre-training distribution temporarily. Under a reverse fine-tune, the model rebounds toward pre-training much faster than it moved toward alignment. Treating an LM as a lossless compressor, the paper derives that the change in compression rate is inversely proportional to dataset size; alignment data is orders of magnitude smaller than the pre-training corpus, so the constraint is naturally weaker.

After DeepSeek-R1, RLVR (Reinforcement Learning with Verifiable Rewards) has more or less become the standard recipe for “growing reasoning ability out of a small model on its own,” and the ever-rising pass@1 curves make it easy to feel that RL keeps teaching the model new tricks. The Tsinghua LeapLab and SJTU paper Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (NeurIPS 2025 Oral; also ICML 2025 AI4MATH workshop best paper) asks a direct question: is RLVR adding new reasoning ability to the base model, or is it just sampling its existing reasoning paths more reliably?

The authors’ core claim: with a large enough k (128 to 1024), the pass@k of an RLVR-trained model beats base at small k but is consistently overtaken by base at large k. Further coverage and perplexity analyses show that the reasoning paths RLVR outputs already live inside the base model’s sampling distribution. RLVR only sharpens the distribution onto the questions base can already solve, without introducing problems base cannot. Only distillation actually expands the set of problems the model can solve.

ANCE (Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval, 2020, Microsoft, ICLR 2021) is an old 2020 paper, but the standard practice of hard negative mining in dense retrieval training basically took shape here, and DPR variants, E5, and BGE all inherit the same framework. The core idea: maintain an ANN index throughout training, sample the negatives that the current model finds hardest to distinguish from the whole corpus, and refresh the index asynchronously on a fixed schedule.

The paper has two contributions. First, an importance-sampling variance analysis showing that the in-batch and BM25 negatives commonly used in dense retrieval training have near-zero gradient norms and are the convergence bottleneck. Second, a concrete recipe for sampling hard negatives from the full corpus via ANN, together with a solution to the engineering bottleneck of “the index has to stay in sync with the model.” On TREC 2019 DL, MS MARCO, NQ, and TQA, a BERT-Siamese backbone trained with ANCE reaches NDCG@10 0.628 (MaxP) on document retrieval, MRR@10 0.330 on passage retrieval, beats DPR on Top-20 Coverage on both NQ and TQA, and improves offline retrieval quality by roughly 14%~15% on an 8B-corpus production setting.

ICML 2025 named 8 Outstanding Papers, and Roll the Dice & Look Before You Leap: Going Beyond the Creative Limits of Next-Token Prediction is one of them. The authors are from CMU and Google Research. The paper asks a blunt question: why do LLMs keep producing bland, repetitive output on open-ended tasks like writing puns, designing Olympiad problems, or brainstorming research ideas?

The authors’ core argument: on these tasks humans first commit to an abstract idea and then generate content around it; next-token prediction (NTP) cannot learn this pattern. To fix it, you first have to swap out the training objective so the model can actually learn that hidden idea, and then move the inference-time randomness from the output side to the input side, so the idea is not torn apart by per-position noise during sampling.

The core idea of NV-Embed (NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, 2024, NVIDIA, ICLR 2025): start directly from Mistral 7B, drop the causal attention mask, attach a latent attention layer on top of the last hidden state for pooling, then run two-stage contrastive instruction tuning, retrieval data with in-batch negatives first, then a mix with non-retrieval data and in-batch negatives turned off. On the 56-task MTEB, NV-Embed-v1 averages 69.32, and v2 pushes the score to 72.31 with hard-negative mining, synthetic data, and example-based multi-class labeling, taking the No.1 spot on MTEB in May and August 2024 respectively.

Shallow Safety Alignment (Safety Alignment Should Be Made More Than Just a Few Tokens Deep, 2024, Princeton & Google DeepMind, ICLR 2025 Outstanding Paper) makes one core argument: the safety alignment of current LLMs primarily modifies the generative distribution of only the first few output tokens. The paper names this phenomenon shallow safety alignment, and shows it explains the common cause behind multiple jailbreaks: prefilling, adversarial suffix attacks, decoding-parameter attacks, and fine-tuning attacks.

The authors locate the phenomenon through three experiments — per-token KL divergence, prefix-prefilling, and per-token fine-tuning dynamics — and verify that “deepening alignment” mitigates multiple attacks via a simple data augmentation and a constrained fine-tuning loss. Experiments are conducted on Llama-2-7B-Chat and Gemma-1.1-7B-IT, evaluated on HEx-PHI, AdvBench, and MaliciousInstruct.

The core idea of Gated Attention (Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free, 2025, Qwen Team, NeurIPS 2025 Best Paper): insert a head-specific, elementwise sigmoid gate after the Scaled Dot-Product Attention output. Trained on 1.7B dense and 15B MoE (A2.54B activated) models for 3.5T tokens each, the method reduces PPL by roughly 0.05–0.27 (depending on model and setting), substantially suppresses training loss spikes, improves long-context extrapolation, and dramatically mitigates the attention sink (BOS token attention drops from 46.7% to 4.8%).

The authors systematically compare 5 candidate positions and several gate forms (granularity, parameter sharing, multiplicative/additive, activation function), totaling 30 variants. The above configuration is the best across all settings.

The core idea of MRL (Matryoshka Representation Learning, 2022): during training, force the first $m$ dimensions of the same $d$-dimensional vector (with $m$ chosen on a logarithmic ladder ${8, 16, 32, \dots, d}$) to independently carry the classification loss, producing a coarse-to-fine nested representation. At inference, just take the first $m$ dimensions according to the compute budget; accuracy matches a separately trained $m$-dim model.

GritLM (Generative Representational Instruction Tuning, 2024) makes one core proposal: a single LLM handles both embedding and generation, distinguishing the two streams via the instruction format, trained jointly with a contrastive loss and a language modeling loss summed together. Where HyDE showed that “LLM handles relevance, encoder handles similarity” can be decoupled, GritLM goes the other way and merges them back into one model.

HyDE (Precise Zero-Shot Dense Retrieval without Relevance Labels, 2022) makes one core proposal: in zero-shot retrieval, instead of asking an unsupervised encoder to model query-document relevance directly, first let an LLM generate a hypothetical document for the query, then encode that document and use it for retrieval. The training objective of LLM2Vec-Gen is essentially this two-step pipeline internalized into the encoder.

Last year’s LLM2Vec showed that decoder-only LLMs can be turned into solid embedding models. This year the same McGill NLP group released LLM2Vec-Gen with the opposite intuition: an embedding should represent not the query itself, but the LLM’s potential response to that query.

A concrete example: a user types “how to commit fraud”. A standard embedder encodes the query and retrieves fraud-related passages. LLM2Vec-Gen encodes the response the model would have produced, “I’m sorry, but I can’t assist with that”, and retrieves refusal-style text instead. Safety alignment isn’t retrained at the embedding stage; it’s inherited directly from the generation side.

Another ICLR 2026 outstanding paper, from Microsoft Research and Salesforce Research. The title is direct: LLMs Get Lost In Multi-Turn Conversation. The conclusion is also direct: take any single-turn task, split the instruction across multiple turns, and 15 frontier LLMs (including GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, o3, DeepSeek-R1) all show degradation, averaging -39%. The paper specifically emphasizes that the unreliability increase has roughly the same magnitude across all models, regardless of scale, reasoning capability, or open/closed weight.

What’s more interesting is the breakdown: aptitude only drops about 15%, while unreliability surges 112%. In other words, models in multi-turn settings aren’t dumber, they’re inconsistent. The same task run ten times produces best-case and worst-case results separated by 50 points.

DSI (Differentiable Search Index) is one of the earlier representative papers on generative retrieval, published at NeurIPS 2022. Its core approach: encode the entire document corpus into the parameters of a Transformer, and at retrieval time, decode the document ID directly with a seq2seq model, removing the inverted index, vector store, and nearest-neighbor search as separate components.

The earlier GENRE (De Cao et al., 2020) had already used a seq2seq model to autoregressively decode Wikipedia entity titles, and DSI cites it as related work. DSI’s further contribution is extending the decoding target from semantically meaningful entity names to arbitrary forms of docid (including random integers and hierarchical semantic IDs), and systematically comparing document representations, ID representations, and training strategies. This reframes retrieval from a systems engineering problem into an end-to-end machine learning problem: indexing is training, retrieval is inference.

One of the two Outstanding Paper awards at ICLR 2026 went to a purely theoretical work: Transformers are Inherently Succinct. A paper with no experiments, no benchmarks, nothing but mathematical proofs, and it won best paper. The review committee’s rationale: “it offers a new perspective for explaining the power of the Transformer architecture.” The original paper is dense, drawing heavily on formal language theory and complexity theory. This post attempts to lay out the core results and construction ideas in a more intuitive way.

What is this “new perspective”? Past work compared expressiveness, i.e. which model can recognize a broader class of languages. This paper asks a different question: for the same language, which model can describe it using less space?

Embedding models have long been BERT’s territory. Semantic search, RAG, clustering — all dominated by encoder-only models. Decoder-only models like GPT and LLaMA crush generation tasks, but the community has assumed they’re not suitable for embeddings because causal attention only sees preceding tokens and can’t build complete sentence representations.

LLM2Vec (COLM 2024) argued this assumption is wrong. Three steps, no labeled data, no GPT-4 synthetic data, and any decoder-only LLM becomes a SOTA embedding model on MTEB. Two years later, the approach has been validated and adopted by many subsequent works — worth revisiting what it actually did.

There used to be a popular claim: human-generated data on the internet would be exhausted within a few years, and LLM training data was running out. Epoch AI predicted in 2022 that high-quality text data might be depleted around 2026. The discussion was intense at the time, but it seems to come up less often now.

The internet is indeed flooded with AI-generated content. Search for almost anything and the first few results likely include AI-written text. According to earlier fears, this AI-generated text would be crawled back as training data, models eating their own output, getting worse with each iteration — the so-called model collapse. A Nature paper rigorously demonstrated this: recursively training on a model’s own output causes tail distributions to gradually vanish, making outputs increasingly homogeneous.

I’ve written about Claude Code’s memory management, context compaction, RAG, security classifier, Edit tool, and sub-agent cache sharing — each time reading the source code line by line. But after reading this paper, I realized I’d been looking at individual parts without seeing the whole machine.

The paper is Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems , 46 pages, a complete architectural dissection of Claude Code from source code. Not a usage tutorial, not a benchmark evaluation, but an answer to an engineering question: what is a production AI agent system’s code actually doing?

RAG re-retrieves, re-assembles, and re-reasons on every query. Ask something that requires synthesizing five documents and the model has to find all five, stitch them together, and derive the answer from scratch. Ask ten times, retrieve ten times. Nothing accumulates.

Karpathy recently posted a gist called LLM Wiki proposing a different approach: instead of retrieving at query time, have the LLM pre-compile knowledge into a structured wiki and query the compiled result.

Dense retrieval has become the default first stage in RAG pipelines. Encode documents into vectors, encode queries into vectors, compute cosine similarity, done. But a basic question rarely gets asked: for a d-dimensional embedding, how many distinct top-k retrieval results can it actually represent?

An ICLR 2026 paper from Google DeepMind and JHU, “On the Theoretical Limitations of Embedding-Based Retrieval”, gives a mathematical answer: not enough. Not even close.