NV-Embed: Training a Decoder-only Embedding Model with Latent Attention Pooling

Publish on: 2026-05-12 Classify at: NLP Read:≈ 7min Views: Comments:

The core idea of NV-Embed (NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models, 2024, NVIDIA, ICLR 2025): start directly from Mistral 7B, drop the causal attention mask, attach a latent attention layer on top of the last hidden state for pooling, then run two-stage contrastive instruction tuning, retrieval data with in-batch negatives first, then a mix with non-retrieval data and in-batch negatives turned off. On the 56-task MTEB, NV-Embed-v1 averages 69.32, and v2 pushes the score to 72.31 with hard-negative mining, synthetic data, and example-based multi-class labeling, taking the No.1 spot on MTEB in May and August 2024 respectively.

Background

LLM2Vec and GritLM already showed that decoder-only LLMs can serve as embedding models once turned bidirectional, but each route introduces extra machinery: LLM2Vec adds an MNTP warm-up stage, GritLM adds a generation loss trained jointly. NV-Embed asks whether you can drop all of that and rely on just “remove the causal mask + change the pooling layer + split training into two stages” to get a better score.

The other question is pooling. The two mainstream choices for decoder-LLM embeddings both have issues. The last EOS token suffers from recency bias and leans too much on the final token’s output. Mean pooling averages all tokens and dilutes information from key phrases. The paper proposes latent attention as a third option.

Method

Bidirectional attention. The causal mask is removed at the contrastive training stage. Unlike LLM2Vec, there is no MNTP warm-up; the paper’s justification is empirical (“it just works”), so no extra training phase is added.

Latent attention layer. Denote the decoder’s last hidden state as $Q \in \mathbb{R}^{l \times d}$, where $l$ is sequence length and $d$ is hidden dim. Introduce a learnable “dictionary” $K = V \in \mathbb{R}^{r \times d}$, where $r$ is the dictionary size. One cross-attention pass:

$$O = \text{softmax}(QK^T)V$$

The output $O \in \mathbb{R}^{l \times d}$ goes through a two-layer MLP with GELU, then mean pooling to obtain the sentence vector. NV-Embed uses $r = 512$ and 8 attention heads. The paper frames the design as “dictionary learning”, distinct from Perceiver IO (which uses the latent array as Q to compress sequence length; NV-Embed is the other way around, sequence as Q, latent array as K/V, output length remains $l$).

Two-stage instruction tuning. Stage one uses only retrieval datasets (MSMARCO, HotpotQA, NQ, SQuAD, FEVER, MIRACL, etc., 16 in total) with in-batch negatives plus mined hard negatives, so the model learns retrieval first. Stage two mixes in non-retrieval data covering classification, clustering and STS, and turns off in-batch negatives. The reason for disabling in-batch negatives is that classification tasks have small label sets (e.g., binary True/False), so other samples in a mini-batch are often positives, and mistaking positives for hard negatives pollutes the contrastive signal.

Instruction template: Instruct: {task_definition} Query: q+. The paper notes that instruction tokens are masked out of the output embedding during both training and evaluation (they still influence the representation via self-attention but are excluded from the final pooling), and documents get no instruction.

Results

MTEB. NV-Embed-v1 (public data only, no synthetic data or hard-negative mining) averages 69.32, beating the then-current E5-mistral-7b-instruct (66.63) and SFR-Embedding (67.56). NV-Embed-v2 adds positive-aware hard-negative mining, 120,000 synthetic examples generated by Mixtral-8x22B, and example-based multi-class labels, reaching 72.31 and taking No.1 on MTEB in May and August 2024.

By subtask, v2 hits 62.65 nDCG@10 on 15 retrieval tasks, 58.46 V-Measure on 11 clustering tasks, 90.37 accuracy on 12 classification tasks, and 84.31 on 10 STS tasks, all top of the leaderboard.

Compared to SFR-Embedding-2R (also Mistral 7B based, 70.31): SFR continues fine-tuning from E5-mistral-7b-instruct and inherits the causal mask and last-token pooling. NV-Embed trains directly from Mistral 7B with bidirectional attention plus latent attention pooling. The paper attributes v2’s edge mainly to these two architectural differences and the mixed-task batch strategy (SFR uses task-homogeneous batching, which the authors argue produces “zigzag” gradient updates).

AIR-Bench. An out-of-domain retrieval benchmark outside MTEB. NV-Embed is No.1 on the Long-Doc section and No.2 on QA, suggesting the gains are not overfitting to MTEB’s training splits.

Key Ablations

Pooling × attention mask. A full 4 × 2 grid after stage-one training:

Pooling	Bidirect	Causal
EOS	62.68	60.06
Mean	64.00	62.32
Latent-attention	64.18	63.39
Self-attention	63.27	63.11

Two observations:

Bidirectional attention helps every pooling choice. EOS pooling gains the most (+2.62), latent attention gains the least (+0.79), suggesting latent attention partially mitigates the limitation of the causal mask.
Self-attention pooling (an extra plain self-attention layer over the LLM output) brings essentially no gain. The paper’s explanation: the LLM already stacks many self-attention layers, one more is redundant. Latent attention’s dictionary mechanism, with “external key-values”, is what provides a genuinely different inductive bias.

The final configuration (latent attention + bidirectional + two-stage training + full data augmentation) reaches 72.31.

Two-stage vs single-stage. Single-stage training with in-batch negatives on classification data is disastrous (MTEB 70.83). Turning off in-batch negatives gets single-stage to 71.94; two-stage to 72.31. The gain of two-stage over single-stage-no-inbatch is concentrated on retrieval (BEIR 62.65 vs 61.37). Reversing the order to “non-retrieval first, retrieval second” drops to 71.85, so order matters. The paper attributes this to retrieval being the hardest subtask: it must be trained first with in-batch negatives, then non-retrieval data is layered on for the other tasks.

Per-step data augmentation gains. From S0 (no augmentation) to S3 (everything):

S0 (baseline): MTEB 70.73, BEIR 59.22
S1 (+ hard-negative mining): BEIR 61.52 (+2.3)
S2 (+ more public retrieval sets: HoVer, SciFact, NFCorpus, MIRACL, Mr.TyDi): BEIR 62.28
S3 (+ 120k Mixtral synthetic samples): MTEB 72.31, BEIR 62.65

The biggest single-step gain comes from hard-negative mining. The marginal gain from synthetic data is small (~0.24 points); the paper offers no deeper analysis, possibly because the public data already covers MTEB’s task distribution well enough.

Positive/negative construction for multi-class tasks. Label-based (use the class label text itself as positive) vs example-based (sample another example from the same class as positive). Example-based averages 69.27 on 16 classification and clustering datasets, label-based 64.80. Clustering benefits the most; for instance, Reddit-Clustering goes from 59.83 to 71.10. The paper attributes this to label text being too short and information-sparse, while a same-class example gives the model richer intra-class similarity to learn from.

Some Details

Hard-negative mining threshold. The paper uses E5-mistral-7b-instruct as a teacher to score candidate negatives, setting max_negative_score_threshold = pos_score × 0.95. Only samples below this threshold enter the negative pool, avoiding treating potential false negatives (actually other positives) as hard negatives. The technique comes from the same group’s NV-Retriever .

Mixed batch instead of task-homogeneous batch. SFR-Embedding-Mistral packs a single task per batch; NV-Embed mixes all tasks in a batch. The paper attributes the advantage to avoiding “zigzag” gradient directions, but offers no deeper theoretical analysis, leaning on the ablation score gap.

Latent array size. $r = 512$, 8 heads. The paper does not report a sensitivity ablation on $r$, and the rationale for this value is not elaborated.

Model compression. The appendix gives a recipe to compress from 8B to 3.5B: SparseGPT pruning (4:8 semi-structured, 70.90) + INT8 quantization + LoRA fine-tuning recovery + knowledge distillation. The pruning step removes the latent attention block entirely, with distillation recovering accuracy from the uncompressed teacher. The compressed version reaches 71.48 on MTEB, still beating same-scale models trained directly on Llama3.2-3B, Qwen2.5-3B, and Minitron-4B. INT8 quantization is almost lossless (variations within ±0.01%), which the paper credits as an inherent property of LLM-based embedding models.

Relation to LLM2Vec/GritLM. NV-Embed’s claim is that the extra training phase (MNTP) and the mixed training objective (GRIT) are not necessary; just removing the causal mask and using a better pooling layer is enough. The cost is a few extra parameters in the latent attention layer (latent array $512 \times 4096$ plus Q/K/V projections plus a two-layer MLP), still negligible against the 7B backbone.

Wrap-up

NV-Embed pushes the “decoder-LLM as embedding model” route to a relatively clean form: drop the causal mask, latent-attention pooling, two-stage training, careful data curation. None of these steps is complex on its own, but together they get 72.31 on MTEB. The contribution is not a new paradigm but trimming the architectural changes and the training pipeline down to near-essentials, giving follow-up work a clean baseline. A caveat worth noting: v2’s 72.31 includes the advantage of training on MTEB’s training splits (MSMARCO and others are MTEB training subsets), and the relative advantage on AIR-Bench’s out-of-domain retrieval is smaller than on MTEB itself; real deployments still need to consider whether the task distribution aligns with MTEB’s training set.