LLM2Vec-Gen: Embeddings That Encode the Model's Answer, Not the Query

Last year’s LLM2Vec showed that decoder-only LLMs can be turned into solid embedding models. This year the same McGill NLP group released LLM2Vec-Gen with the opposite intuition: an embedding should represent not the query itself, but the LLM’s potential response to that query.

A concrete example: a user types “how to commit fraud”. A standard embedder encodes the query and retrieves fraud-related passages. LLM2Vec-Gen encodes the response the model would have produced, “I’m sorry, but I can’t assist with that”, and retrieves refusal-style text instead. Safety alignment isn’t retrained at the embedding stage; it’s inherited directly from the generation side.

Input-centric vs output-centric

Input-centric vs output-centric embedding

The dominant embedding paradigm is contrastive learning: pull queries and positive documents together, push negatives away, project everything into a new shared space. That space has no correspondence to the LLM’s pretrained output representations. Whatever safety alignment, reasoning ability, or instruction-following the LLM acquired during pretraining and alignment gets stripped away by the contrastive objective reshaping the representations.

LLM2Vec-Gen’s angle: rather than reshape a query-document relevance space, keep embeddings inside the LLM’s own output representation space. Two queries that look very different on the surface should produce close embeddings if the LLM’s responses to them are semantically close. HyDE proposed this intuition earlier by generating multiple hypothetical answers at inference and averaging their embeddings. The cost is one generation pass per query plus retrieval variance from sampling. LLM2Vec-Gen internalizes the step into training, so inference is a single deterministic forward pass.

Training: special tokens that compress the LLM’s response

LLM2Vec-Gen architecture

Only two parameter groups update during training: a set of new compression tokens, and two lightweight MLPs. The LLM backbone stays frozen end to end.

Setup: pull 160K unlabeled queries from Tulu and let the LLM generate its own responses. The “ground truth” responses come from the model itself, no human labels involved, which makes the pipeline strictly self-supervised. Then pass these responses through an unsupervised embedding teacher to obtain target vectors. The teacher is the corresponding unsupervised LLM2Vec model sharing the same backbone. This pairing is critical: the teacher must share the underlying representation space and must be trained without supervision, otherwise relevance bias contaminates the target and undermines faithful representation of the response content.

Training input is straightforward: query followed by n new compression tokens (default 10), passed once through the frozen LLM. Hidden states of those tokens get projected through MLPs and pooled into the embedding. Two losses run together:

  • Alignment loss L_align: MSE between the produced embedding and the teacher’s embedding of the response
  • Reconstruction loss L_recon: feed the compression token hidden states back to the frozen LLM as soft prompts and require it to autoregressively regenerate the original response

Alignment pulls the embedding toward the teacher’s response vector. Reconstruction forces those compression tokens to remain decodable by the frozen LLM, constraining the embedding to lie within the vector subspace the LLM’s decoder can actually process. One external anchor, one internal constraint, both squeezing the same solution space. An 8B model trains in about 3.5 hours on 2 H100s with batch size 32, single epoch.

Results: self-supervised SOTA on MTEB, with safety and reasoning as freebies

MTEB scaling

On MTEB(eng, v2), the Qwen-3-8B variant scores 61.9, an 8.8% gain over the unsupervised LLM2Vec teacher and within 3.8 points of supervised SOTA (65.7). Across Llama-3.1/3.2, Qwen-2.5, and Qwen-3, every evaluated size from 0.5B to 8B beats its respective teacher.

Looking at categories, the largest gains land on clustering (+22.7%), STS (+9.8%), and classification (+7.0%). What these tasks share: input variance is high but expected outputs converge. Different news articles about the same event should cluster together; two paraphrases with very different wording should score high on similarity. This is exactly where output-centric embeddings have a structural advantage. Standard retrieval gains are smaller, and Qwen-3-4B even drops 1.3 points. The paper attributes this to retrieval relying heavily on lexical overlap: words present in the query should also appear in the document, where output-centric encoding takes a detour.

Once the benchmark requires deeper semantic understanding, the picture flips. On BRIGHT, a reasoning-intensive retrieval benchmark, the LLM’s reasoning ability transfers directly into the embedding, and the larger the model the more thoroughly it transfers: 11.7% at 1.7B, 19.7% at 4B, 35.6% at 8B over the LLM2Vec teacher. This scaling line is the single chart most worth remembering: bigger backbone, bigger inherited dividend from the output-centric paradigm.

Safety follows the same pattern. On AdvBench-IR, which measures whether a retriever serves malicious queries, every model size shows lower unsafe retrieval rates: the 1.7B model drops 22.6%, the 8B drops 17%. The mechanism is straightforward: embeddings represent refusal text, refusals cluster near each other and away from the actual harmful passages in the corpus.

Why reconstruction loss can’t be dropped

The most interesting detail hides in the ablation. With only the alignment loss, MTEB-Lite barely moves from 67.9 (to 67.5), making reconstruction look optional. But the paper runs a separate test: try to decode each variant’s embedding back into text via the LLM. Without reconstruction the output is gibberish; with it, the decoded text is semantically related to the original response.

This isn’t only about interpretability. Alignment loss only pushes the embedding to a position numerically close to the teacher, but that position isn’t necessarily one the LLM can process: feeding it back to the LLM yields gibberish, meaning the vector landed in a region the LLM’s decoder barely covered during training. Reconstruction forces the compression tokens to remain decodable by the frozen LLM, constraining the embedding to lie within the subspace the decoder actually handles. The two objectives look redundant but in fact squeeze the same solution from two sides.

A side benefit is Logit Lens analysis: project compression token states onto the vocabulary and read off the top tokens directly. In the paper’s examples, “polar bears live where” surfaces “Arctic”, “ice”, “snow”; “disk cleanup means what” surfaces “space”, “temporary”, “files”. The embedding is essentially a latent set of answer keywords, mirroring the way humans hear a question and immediately think of the answer.

An overlooked detail: the teacher must be unsupervised

The ablation also swaps in a supervised embedding teacher. The student does not automatically inherit the gains; recovering them requires adding LoRA. Counterintuitive: shouldn’t a stronger teacher lift the student to a higher tier?

The paper’s reading: supervised teachers have spaces already reshaped around query-document relevance, packed with judgments about “what should be considered related” but no longer faithful to “what the content itself looks like.” The whole theoretical foundation of LLM2Vec-Gen is that embeddings should faithfully represent the LLM’s response content; letting a relevance-biased teacher define the target reintroduces the very paradigm problem the method was designed to escape.

A weak unsupervised teacher like SimCSE that only enforces uniformity regularization turns out to be the right fit. It barely disturbs the LLM’s original representation geometry, just pulling synonymous views together and pushing random samples apart. That light touch leaves enough room for the LLM’s own semantic structure to come through.

Practical tradeoffs

A few deployment details worth attention. Keeping the LLM frozen is critical: the same model weights serve both generation and embedding, with no need to maintain a separate checkpoint for the embedding service. The LoRA variant (r=8) gains another 0.4 points but requires switching between two sets of weights, and the paper recommends sticking with the frozen version.

Response generator choice barely affects MTEB; mainstream generator sizes land within a point or two of each other. Generator choice does affect safety though: the smallest models produce responses with more unsafe content, and the resulting embedding inherits that weakness. So LLM2Vec-Gen ties embedding safety to generator safety, which is both a feature and a fragility.

10 compression tokens is the sweet spot. Even a single token reaches 66.1, 10 tokens get to 68.5, and beyond that the curve flattens. The information bottleneck for compressing one response into a fixed-length vector doesn’t need to be very wide, consistent with prior findings in RAG compression work.

Closing thought

The real appeal of output-centric embeddings isn’t the benchmark numbers. It’s that embedding models rejoin the main capability curve of the LLM. Training embeddings used to be a separate engineering track, parallel to LLM pretraining and alignment. As LLMs got stronger, embedding models did not automatically get stronger; the contrastive learning data had to be reconstructed each time. LLM2Vec-Gen has embeddings directly consume the LLM’s output capability. The backbone gains an inch, the embedder gains an inch. Two tracks merge into one.

The cost is that the embedding model inherits every upstream problem. The most lethal one: this paradigm assumes the LLM’s response is a reasonable proxy for the retrieval target. The moment the LLM is outdated or hallucinating in some domain, the embedding drags the wrong answer into the matching space, with not even the naive “wrong word in the query” sanity check left. On top of that, when the LLM’s refusal style shifts, retriever behavior shifts in lockstep; whatever bias the LLM carries goes along for the ride. It’s a bet on the foundation model continuing to improve, and improving in directions that happen to align with what retrieval needs.