Finisky Garden

NLP, Software Engineering, Product Design

0%

The core idea of NV-Embed (NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, 2024, NVIDIA, ICLR 2025): start directly from Mistral 7B, drop the causal attention mask, attach a latent attention layer on top of the last hidden state for pooling, then run two-stage contrastive instruction tuning, retrieval data with in-batch negatives first, then a mix with non-retrieval data and in-batch negatives turned off. On the 56-task MTEB, NV-Embed-v1 averages 69.32, and v2 pushes the score to 72.31 with hard-negative mining, synthetic data, and example-based multi-class labeling, taking the No.1 spot on MTEB in May and August 2024 respectively.

Shallow Safety Alignment (Safety Alignment Should Be Made More Than Just a Few Tokens Deep, 2024, Princeton & Google DeepMind, ICLR 2025 Outstanding Paper) makes one core argument: the safety alignment of current LLMs primarily modifies the generative distribution of only the first few output tokens. The paper names this phenomenon shallow safety alignment, and shows it explains the common cause behind multiple jailbreaks: prefilling, adversarial suffix attacks, decoding-parameter attacks, and fine-tuning attacks.

The authors locate the phenomenon through three experiments — per-token KL divergence, prefix-prefilling, and per-token fine-tuning dynamics — and verify that “deepening alignment” mitigates multiple attacks via a simple data augmentation and a constrained fine-tuning loss. Experiments are conducted on Llama-2-7B-Chat and Gemma-1.1-7B-IT, evaluated on HEx-PHI, AdvBench, and MaliciousInstruct.

The core idea of Gated Attention (Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free, 2025, Qwen Team, NeurIPS 2025 Best Paper): insert a head-specific, elementwise sigmoid gate after the Scaled Dot-Product Attention output. Trained on 1.7B dense and 15B MoE (A2.54B activated) models for 3.5T tokens each, the method reduces PPL by roughly 0.05–0.27 (depending on model and setting), substantially suppresses training loss spikes, improves long-context extrapolation, and dramatically mitigates the attention sink (BOS token attention drops from 46.7% to 4.8%).

The authors systematically compare 5 candidate positions and several gate forms (granularity, parameter sharing, multiplicative/additive, activation function), totaling 30 variants. The above configuration is the best across all settings.

The core idea of MRL (Matryoshka Representation Learning, 2022): during training, force the first $m$ dimensions of the same $d$-dimensional vector (with $m$ chosen on a logarithmic ladder ${8, 16, 32, \dots, d}$) to independently carry the classification loss, producing a coarse-to-fine nested representation. At inference, just take the first $m$ dimensions according to the compute budget; accuracy matches a separately trained $m$-dim model.

GritLM (Generative Representational Instruction Tuning, 2024) makes one core proposal: a single LLM handles both embedding and generation, distinguishing the two streams via the instruction format, trained jointly with a contrastive loss and a language modeling loss summed together. Where HyDE showed that “LLM handles relevance, encoder handles similarity” can be decoupled, GritLM goes the other way and merges them back into one model.

HyDE (Precise Zero-Shot Dense Retrieval without Relevance Labels, 2022) makes one core proposal: in zero-shot retrieval, instead of asking an unsupervised encoder to model query-document relevance directly, first let an LLM generate a hypothetical document for the query, then encode that document and use it for retrieval. The training objective of LLM2Vec-Gen is essentially this two-step pipeline internalized into the encoder.

Last year’s LLM2Vec showed that decoder-only LLMs can be turned into solid embedding models. This year the same McGill NLP group released LLM2Vec-Gen with the opposite intuition: an embedding should represent not the query itself, but the LLM’s potential response to that query.

A concrete example: a user types “how to commit fraud”. A standard embedder encodes the query and retrieves fraud-related passages. LLM2Vec-Gen encodes the response the model would have produced, “I’m sorry, but I can’t assist with that”, and retrieves refusal-style text instead. Safety alignment isn’t retrained at the embedding stage; it’s inherited directly from the generation side.

Another ICLR 2026 outstanding paper, from Microsoft Research and Salesforce Research. The title is direct: LLMs Get Lost In Multi-Turn Conversation. The conclusion is also direct: take any single-turn task, split the instruction across multiple turns, and 15 frontier LLMs (including GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, o3, DeepSeek-R1) all show degradation, averaging -39%. The paper specifically emphasizes that the unreliability increase has roughly the same magnitude across all models, regardless of scale, reasoning capability, or open/closed weight.

What’s more interesting is the breakdown: aptitude only drops about 15%, while unreliability surges 112%. In other words, models in multi-turn settings aren’t dumber, they’re inconsistent. The same task run ten times produces best-case and worst-case results separated by 50 points.

DSI (Differentiable Search Index) is one of the earlier representative papers on generative retrieval, published at NeurIPS 2022. Its core approach: encode the entire document corpus into the parameters of a Transformer, and at retrieval time, decode the document ID directly with a seq2seq model, removing the inverted index, vector store, and nearest-neighbor search as separate components.

The earlier GENRE (De Cao et al., 2020) had already used a seq2seq model to autoregressively decode Wikipedia entity titles, and DSI cites it as related work. DSI’s further contribution is extending the decoding target from semantically meaningful entity names to arbitrary forms of docid (including random integers and hierarchical semantic IDs), and systematically comparing document representations, ID representations, and training strategies. This reframes retrieval from a systems engineering problem into an end-to-end machine learning problem: indexing is training, retrieval is inference.

One of the two Outstanding Paper awards at ICLR 2026 went to a purely theoretical work: Transformers are Inherently Succinct. A paper with no experiments, no benchmarks, nothing but mathematical proofs, and it won best paper. The review committee’s rationale: “it offers a new perspective for explaining the power of the Transformer architecture.” The original paper is dense, drawing heavily on formal language theory and complexity theory. This post attempts to lay out the core results and construction ideas in a more intuitive way.

What is this “new perspective”? Past work compared expressiveness, i.e. which model can recognize a broader class of languages. This paper asks a different question: for the same language, which model can describe it using less space?

Embedding models have long been BERT’s territory. Semantic search, RAG, clustering — all dominated by encoder-only models. Decoder-only models like GPT and LLaMA crush generation tasks, but the community has assumed they’re not suitable for embeddings because causal attention only sees preceding tokens and can’t build complete sentence representations.

LLM2Vec (COLM 2024) argued this assumption is wrong. Three steps, no labeled data, no GPT-4 synthetic data, and any decoder-only LLM becomes a SOTA embedding model on MTEB. Two years later, the approach has been validated and adopted by many subsequent works — worth revisiting what it actually did.

There used to be a popular claim: human-generated data on the internet would be exhausted within a few years, and LLM training data was running out. Epoch AI predicted in 2022 that high-quality text data might be depleted around 2026. The discussion was intense at the time, but it seems to come up less often now.

The internet is indeed flooded with AI-generated content. Search for almost anything and the first few results likely include AI-written text. According to earlier fears, this AI-generated text would be crawled back as training data, models eating their own output, getting worse with each iteration — the so-called model collapse. A Nature paper rigorously demonstrated this: recursively training on a model’s own output causes tail distributions to gradually vanish, making outputs increasingly homogeneous.

I’ve written about Claude Code’s memory management, context compaction, RAG, security classifier, Edit tool, and sub-agent cache sharing — each time reading the source code line by line. But after reading this paper, I realized I’d been looking at individual parts without seeing the whole machine.

The paper is Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems , 46 pages, a complete architectural dissection of Claude Code from source code. Not a usage tutorial, not a benchmark evaluation, but an answer to an engineering question: what is a production AI agent system’s code actually doing?

RAG re-retrieves, re-assembles, and re-reasons on every query. Ask something that requires synthesizing five documents and the model has to find all five, stitch them together, and derive the answer from scratch. Ask ten times, retrieve ten times. Nothing accumulates.

Karpathy recently posted a gist called LLM Wiki proposing a different approach: instead of retrieving at query time, have the LLM pre-compile knowledge into a structured wiki and query the compiled result.

Dense retrieval has become the default first stage in RAG pipelines. Encode documents into vectors, encode queries into vectors, compute cosine similarity, done. But a basic question rarely gets asked: for a d-dimensional embedding, how many distinct top-k retrieval results can it actually represent?

An ICLR 2026 paper from Google DeepMind and JHU, “On the Theoretical Limitations of Embedding-Based Retrieval”, gives a mathematical answer: not enough. Not even close.

Justin Sun (the crypto guy) recently dropped a hot take: “It’s 2026 already — if you can talk to AI, stop talking to humans.” He also said something about deleting contacts born before 1990 and WeChat being for old people. Classic Justin Sun. Take it with a grain of salt.

But strip away the absurd parts, and “talk to AI more” is something I actually agree with as a heavy user. Here’s why, and where it falls apart.

There’s a module inside Claude Code called autoDream. Its prompt title reads “Dream: Memory Consolidation.”

This isn’t a metaphor. Claude Code actually spins up a background sub-agent that reviews transcripts from past sessions, consolidates scattered memories — merging, deduplicating, correcting — and writes them back to disk. The whole thing is invisible unless you dig into the background task list.

Tech job boards in 2025 are schizophrenic. Traditional software engineering roles are shrinking. “AI”-prefixed positions are expanding. Same company, same quarter — cutting junior devs and project managers on one side, opening Agent orchestration engineers and AI application architects on the other.

Last June, Shopify CEO Tobi Lutke posted that he preferred “context engineering” over “prompt engineering.” Karpathy retweeted with a +1. Simon Willison wrote a blog post saying the term might actually stick. Phil Schmid published a full definition. Half the AI community switched terminology within a week.

By early 2026, Phil Schmid introduced another term: Agent Harness. It didn’t generate the same buzz, but anyone building Coding Agents quietly nodded along.