GritLM: A Single LLM for Both Embedding and Generation

Publish on: 2026-05-06 Classify at: NLP Read:≈ 10min Views: Comments:

GritLM (Generative Representational Instruction Tuning, 2024) makes one core proposal: a single LLM handles both embedding and generation, distinguishing the two streams via the instruction format, trained jointly with a contrastive loss and a language modeling loss summed together. Where HyDE showed that “LLM handles relevance, encoder handles similarity” can be decoupled, GritLM goes the other way and merges them back into one model.

Problem background

Embedding models and generative models have long lived on separate tracks. BERT-style bidirectional encoders are good at representation, decoder-only LLMs are good at generation, and pulling hidden states out of an LLM directly for embedding generally underperforms. The paper’s reference point: Llama 2 70B with weighted-mean pooling scores 35.6 on MTEB, while BGE Large 0.34B scores 64.2. Two orders of magnitude more parameters, beaten on the task.

The reverse direction also fails. Fine-tune an embedding model and then re-attach the LM head for generation: the paper measures the Emb.-only 7B variant at 23.5 on MMLU (random baseline is 25.0), meaning embedding training has fully washed out the generative capability.

In deployment this forces RAG systems to run two models: an embedding model encodes the query and the documents for retrieval, then a generative model reads the query plus retrieved documents and produces an answer. Query and document each pass through both models, for a total of 4 forward passes. GritLM is built to remove this split.

The GRIT method

One pretrained LLM, two training datasets, two forward paths, two losses summed together with weights.

Embedding stream. Input format <s><|user|>{instruction}<|embed|>{sample}, attention switched to bidirectional, mean pooling over the final hidden states (averaged only over the sample portion; the instruction and format tokens don’t count toward the average but still influence the representation through self-attention). The loss is in-batch-negative InfoNCE:

$$\mathcal{L}_{\text{Rep}} = -\frac{1}{M} \sum_{i=1}^{M} \log \frac{\exp(\sigma(f(q_i), f(d_i)) / \tau)}{\sum_{j=1}^{M} \exp(\sigma(f(q_i), f(d_j)) / \tau)}$$

$\sigma$ is cosine similarity, $\tau$ is the temperature.

Generation stream. Input format <s><|user|>{instruction}<|assistant|>{response}</s>, causal attention preserved, next-token prediction through the LM head, loss computed only over the response tokens.

Total loss:

$$\mathcal{L}_{\text{GRIT}} = \lambda_{\text{Rep}} \mathcal{L}_{\text{Rep}} + \lambda_{\text{Gen}} \mathcal{L}_{\text{Gen}}$$

The <|embed|> special token is the key switch. The E5 dataset has no fixed instruction prefix, so the model has to rely on this token to know whether the current sample is going through the contrastive loss or the LM loss.

GritLM 7B initializes from Mistral 7B, embedding batch size 2048, generation batch size 256, trained for 1253 steps total (1.36 epochs on embedding data, 1 epoch on generation). GritLM 8x7B drops the embedding batch to 256 due to compute limits.

Why the two objectives coexist

Intuitively, contrastive loss pushes the model toward a unified sentence-level representation, while LM loss pushes per-token conditional probabilities. The objectives are different, so joint training should hurt both. The paper’s measured result is GritLM 7B scores 66.8 on MTEB, the same as the Emb.-only variant (66.8); GritLM 7B’s generation average is 55.5, vs. 55.2 for the Gen.-only variant. The two objectives barely fight.

The paper’s explanation is that both task families demand deep language understanding, differing only in how that understanding is expressed. The paper further conjectures that some “small number of parameters act as a switch” inside the model so the final representation is either suitable for mean pooling or primed for the LM head, but the paper explicitly flags this as speculation (“Possibly…”) and does no localization experiment.

Worth noting: under the MEDI2 dataset, adding the generative objective actually improves embedding performance over an embedding-only baseline, but this disappears once they switch to E5 and the two are tied.

Experimental results

MTEB (embedding). GritLM 7B averages 66.8 across 56 datasets, beating E5 Mistral 7B (66.6) and BGE Large (64.2), an open-source SOTA at the time. GritLM 8x7B is 65.7, slightly behind 7B; the paper attributes this to dropping the embedding batch from 2048 to 256 due to compute constraints.

Generation. GritLM 7B averages 55.5 across MMLU/GSM8K/BBH/TyDi QA/HumanEval/AlpacaEval, ahead of Tülu 2 7B (46.3) and Mistral 7B Instruct (44.1), and even beats Llama 2 70B (46.4). GritLM 8x7B averages 65.7, the highest among open generative models the paper compares, ahead of Mixtral 8x7B Instruct (60.3) and Tülu 2 70B (65.1).

Reranking. GritLM works as both bi-encoder and cross-encoder. The paper uses Sun et al. ’s permutation-generation prompt to rerank the top-10 retrieved results, lifting the MTEB retrieval average from 57.4 to 57.9. The Conclusion claims improvements on 15 of 16 retrieval datasets (Table 3 lists 16, with QuoraRetrieval going 89.47→88.67 as the only dropper).

RAG caching speedups. This is one of the paper’s selling points. Traditional RAG runs two separate models: the embedding model encodes the query for retrieval, and the generation model reads “query + retrieved documents” to produce the answer. Both query and document pass through both models, for 4 forwards total. With GritLM, retrieval and generation share weights, so the transformer’s internal KV states computed in the embedding pass can be fed directly into the generation pass, skipping a redundant forward. This isn’t possible across different models because KV states are model-internal representations.

The paper proposes three caching strategies:

Query Caching: cache the query’s KV states during the embedding pass; the generation pass no longer re-forwards the query
Doc Caching: at index build time, store each document’s KV states alongside its embedding; on retrieval hit, feed them straight into generation
Query-Doc / Doc-Query Caching: cache both, but since query and doc never get a chance to attend to each other when cached separately, this diverges from the RAG semantics

On Natural Questions, sample A (query 1 token, doc 4000 tokens) Doc Caching takes 5.25s on CPU vs. 14.18s for traditional RAG, a 63% speedup; sample B (doc 1 token, query 4000 tokens) Query Caching is 6.87s vs. 14.88s, 54% faster. GPU speedups are smaller (around 30%); the paper explains this as GPUs already processing the entire sequence in parallel, so caching parts of it gains less.

But Query Caching changes the query’s attention pattern (bidirectional during embedding, while generation expects causal). The paper measures Query Caching match score dropping from RAG’s 30.50 to 25.46. Doc Caching actually rises slightly to 33.38; the paper’s explanation is that documents don’t need to be understood as thoroughly as queries, so the “slightly corrupted” KV states don’t hurt generation. Query-Doc and Doc-Query Caching, suffering double attention mismatch, fall to 21.63 and 18.39, near the No-RAG baseline of 21.00. The paper’s takeaway: Query-Doc Caching is of limited practical use; one-sided caching is the cost-effective choice.

Key ablations

Attention. Switching the causal LLM to bidirectional attention during fine-tuning, then mean-pooling, gains 1.8 points on embedding (causal+wmean 60.0 → bidirectional+mean 61.8). The paper confirms the conventional wisdom that causal LLMs used for embedding should be made bidirectional. PrefixLM (instruction bidirectional, response causal) instead loses points.

Initialization. Mistral 7B > Llama 2 7B > GPT-J 6B for both embedding and generation. An interesting finding: pre-fine-tuning, GPT-J’s embedding is stronger than Mistral’s, but post-fine-tuning Mistral takes the lead. The conclusion: pretrained embedding ability does not predict fine-tuned embedding ability; pretrained generative ability is the better indicator.

Embedding dataset. E5 (66.0) > MEDI2 (64.7) > MEDI (64.0). The paper attributes E5’s edge to higher-quality hard negatives and task diversity from GPT-4 generation.

Generative loss granularity. Token-level vs. sample-level loss directly affects generation length, which in turn drives AlpacaEval (known to prefer long responses). The final choice is a mix: token level across 32 samples, then sample level across 8 sub-batches. This mix scores 74.7 on AlpacaEval, vs. 67.6 for the more sample-level “Mix (4 -> 64)”, a 7-point gap that tracks median generation length 941 → 865.

In-batch negative origin. Negatives from the same dataset vs. any dataset score the same on average (66.0), but the Retrieval subset gains 1.3. The paper attributes this to same-dataset negatives being harder to distinguish, forcing the model to learn finer-grained differences.

Embedding batch size. Going from 256 to 4096 lifts the embedding average by 1.0, mostly from the 15 retrieval datasets; generation performance is unchanged.

Precision. BF16 mixed precision is fine overall, but pooling and similarity computation must be cast to FP32 or embedding performance dips. The paper offers no deeper theoretical justification, only an empirical recommendation.

Few-shot embedding doesn’t work. Adding a single example after the instruction degrades performance. Even with 5% few-shot samples in MEDI2 training, the model fails to learn how to use them. The paper simply notes “the model seems not to have learned to make good use of the few-shot examples” without further analysis.

A few easy-to-miss implementation details

One-sided instructions for asymmetric tasks. The E5 dataset attaches instructions only to queries on retrieval-style tasks, not documents. This way documents are encoded once and can be reused across tasks, which is cache-friendly. Symmetric tasks train one-sided too but evaluate two-sided; the paper argues this is fine because cosine-similarity transitivity preserves A↔B↔C.

KV cache storage cost. For 2,681,468 documents and a 7B model, the KV states total around 30TB. The paper notes this can be fully offloaded to disk and loaded on demand, around 12.5MB per sample. The original index is just 43GB; the KV cache is three orders of magnitude larger.

KTO alignment trade-off. KTO (Kahneman-Tversky Optimization) is a preference-alignment method that, unlike DPO, doesn’t require paired preference data, only a binary good/bad label per sample. The paper appends a KTO stage after GRIT, on binarized UltraFeedback, training only generation, not embedding. MTEB drops slightly from 66.8 to 66.7, while AlpacaEval gains over 10 points. This means alignment slowly erodes embedding capability, and embedding training has to be maintained alongside to preserve it.

The $\lambda_{\text{Rep}} > \lambda_{\text{Gen}}$ setup. The paper insists on $\mathcal{L}_{\text{Rep}} / \mathcal{L}_{\text{Gen}} > 1$, reasoning that LM loss is already pretrained while contrastive loss is new and needs more learning. In practice, embedding loss drops fast and both losses settle around 1.0 late in training, so the initial weight difference smooths itself out.

Embedding-head trade-off. An optional 4096→1024 down-projection linear layer cuts storage 4× at the cost of 1 point on embedding. GritLM ultimately doesn’t use it, leaving downstream users to apply PCA or similar post-hoc dimension reduction.

Conceptual influence

GritLM’s claim is that embedding and generation are two sides of the same coin and can be handled by one LLM differentiated by instruction. This continues HyDE’s “outsource relevance to the LLM” idea: HyDE assembled two separate models at inference time, GritLM merges them at training time. The price is doubled fine-tuning compute (running both embedding and generation objectives in parallel), but model deployment, caching strategies, and reranker reuse all simplify.

Follow-ups like LLM2Vec and Echo Embedding inherit the “causal LLM → bidirectional + mean pooling for embedding” recipe but skip the unified generation objective. GritLM’s two-objectives-in-one design is more aggressive engineering-wise; whether it’s worth the extra training cost depends on how strongly a deployment scenario benefits from a unified model.

Wrap-up

At 7B, GritLM hits MTEB SOTA and near-best generation simultaneously, showing embedding and generation can coexist in one model without hurting each other, and yielding reranker reuse and RAG caching speedups as direct engineering wins. The cost is that training runs forward/backward on both objectives, roughly doubling single-objective cost; 7B is also heavy for production embedding workloads where 0.1-1B models still hold a price-performance edge; on RAG caching, Doc Caching demands 30TB of KV storage (vs. 43GB for the index), and Query Caching speeds things up but with a visible match-score drop, making one-sided caching the more practical choice. The training tricks the paper’s ablations consolidate (bidirectional attention + mean pooling, BF16 with FP32 pooling cast, in-batch negatives from the same dataset) have been widely reused since.