HyDE: Retrieval via LLM-Generated Hypothetical Documents
HyDE (Precise Zero-Shot Dense Retrieval without Relevance Labels, 2022) makes one core proposal: in zero-shot retrieval, instead of asking an unsupervised encoder to model query-document relevance directly, first let an LLM generate a hypothetical document for the query, then encode that document and use it for retrieval. The training objective of LLM2Vec-Gen is essentially this two-step pipeline internalized into the encoder.
Problem background

The standard dense-retrieval recipe trains two encoders (enc_q for queries, enc_d for documents) and uses their inner product as a relevance score. The objective “relevance” depends on labeled data. Large-scale labeled sets like MS-MARCO are exceptions, not the norm; most settings have nothing comparable. Unsupervised contrastive methods like Contriever can learn document-document similarity, but query-document relevance has no supervision signal to fit, and as a result Contriever often loses to BM25 in zero-shot settings.
HyDE breaks retrieval into two steps:
- Generation: an instruction-tuned LLM (the paper uses InstructGPT /
text-davinci-003) generates a hypothetical document from the query - Encoding: an unsupervised contrastive encoder (Contriever) encodes that document into a vector and runs ANN search over the corpus
Query-document “relevance” is implicitly carried by the generation step, and what remains is document-document similarity, exactly what unsupervised contrastive learning handles well. As the paper puts it: “the query-document similarity score is no longer explicitly modeled nor computed.”
The generated content does not need to be factually correct, only distributionally close to a “relevant document”. The paper’s explanation is that Contriever’s dense bottleneck acts as a lossy compressor and discards fabricated details while preserving the semantic skeleton. This is an assumption rather than something proven; the actual support comes from downstream retrieval scores.
Why a generated hypothetical document works
The generated answer can be wrong on facts (numbers, names, dates can all be hallucinated). The paper offers two reasons it still works:
Lossy-compression nature. A few-hundred-dimensional dense vector cannot hold every detail of a document; what survives is semantics and topic. Even if the LLM gets specific numbers, names, or dates wrong, the semantic skeleton in the embedding space is unchanged and still matches the truly relevant documents. The paper only argues this qualitatively from the lossiness of the dense bottleneck and gives no concrete example.
Examples are easier to learn than abstract concepts. Letting the LLM exemplify “relevance” is easier than asking the encoder to learn it abstractly. By 2022, instruction-tuned LLMs already had strong instruction-following generalization; HyDE outsources relevance modeling to NLG and lets the encoder do only document-document similarity.
LLM2Vec-Gen’s core hypothesis (the embedding should represent the LLM’s response, not the query itself) is a direct continuation of this idea.
Experimental results
Web search (TREC DL19/20). Contriever alone scores nDCG@10 of 44.5/42.1, BM25 50.6/48.0, so Contriever loses to lexical matching. HyDE (InstructGPT generation + Contriever encoding) jumps to 61.3/57.9, about 10 points above BM25. ContrieverFT, fine-tuned on MS-MARCO, scores 62.1/63.2: HyDE essentially ties on DL19 but is still about 5 points behind on DL20 (the paper writes “around 10% lower map and ndcg@10 than ContrieverFT”).
Low-resource BEIR. On Scifact, Arguana, Trec-Covid, FiQA, DBPedia, and TREC-NEWS, HyDE substantially outperforms unsupervised baselines and beats fine-tuned DPR and ANCE on most of them. The exceptions are TREC-Covid (HyDE 59.3 vs BM25 59.5) and FiQA / DBPedia where HyDE underperforms ContrieverFT. FiQA is financial posts and DBPedia is entity retrieval; the paper attributes this to instructions not adequately conveying the domain. The implication is that HyDE’s relevance modeling depends on the LLM’s domain knowledge, and ceilings drop in domains the LLM is not familiar with.
Multilingual (Mr.TyDi). On Swahili, Korean, Japanese, and Bengali, HyDE improves over mContriever but still trails fine-tuned mContrieverFT by visible margins. The paper attributes this to non-English languages being under-trained in both LLM pretraining and instruction tuning, leaving generation quality lower.
Two ablation findings
Bigger LLM, stronger HyDE. With Contriever fixed: FLAN-T5 11B gives nDCG@10 of 48.9/52.9 on DL19/20, Cohere 52B 53.8/53.8, InstructGPT 175B 61.3/57.9. The curve is monotonically increasing, matching the “8B > 4B > 1.7B” trend reported in LLM2Vec-Gen.
Fine-tuned encoder is not always worth it. Replacing Contriever with the MS-MARCO fine-tuned ContrieverFT, paired with weaker instruction LMs (FLAN-T5 11B / Cohere 52B), yields scores slightly below ContrieverFT alone; only with InstructGPT 175B does the combination push beyond ContrieverFT (DL19 nDCG@10 of 67.4, +5 over ContrieverFT alone). The paper’s takeaway: HyDE assumes no labels are available; once in-domain supervision exists, weak LM + fine-tuned encoder is a bad trade, and only a sufficiently strong LM can squeeze further gains out of HyDE.
This delineates HyDE’s applicability: strong on cold-start or no-in-domain-label settings, gives way to a fine-tuned dense retriever once enough labels are available. The paper closes with a phased deployment recommendation: a search system runs HyDE entirely at launch, then gradually shifts to a fine-tuned retriever as logs accumulate, sending high-frequency, mature queries to the supervised model and leaving long-tail and emerging queries to HyDE.
Influence on subsequent work
Engineering-wise, HyDE trains nothing; it just glues InstructGPT and Contriever together with a few prompts. Its conceptual contribution is decoupling “relevance modeling” from representation learning and outsourcing it to NLG. The implicit assumption in earlier dense retrieval was that the encoder must learn relevance; HyDE’s claim is that relevance can be demonstrated by an LLM writing one passage, leaving the encoder to learn only document-document similarity. The paper closes with a literal question: “is numerical relevance just a statistical artifact of language understanding?”
Several follow-ups continue the same proposition:
- LLM2Vec-Gen internalizes HyDE’s “generate at inference + encode” two steps into training, removing the per-query generation cost
- Query2doc, Promptagator apply the same idea to query expansion and data augmentation
- GritLM and similar generative + representational joint training directions share the same proposition
A few easy-to-miss implementation details
Query itself as an extra hypothesis. Equation (8) in the paper adds the query’s own embedding into the average pool with a weight equal to one hypothetical document: v_q = (1/(N+1)) [Σ f(d_k) + f(q)]. The paper only says it “also considers the query as a possible hypothesis” without elaborating motivation.
Averaging over multiple generations. HyDE by default samples multiple hypothetical documents from InstructGPT at temperature=0.7 for the same query, then averages their embeddings to smooth the variance of any single generation. The motivation behind LLM2Vec-Gen internalizing this step lives here too: running multiple generations per query carries triple costs in latency, dollars, and variance.
Per-dataset instruction. Appendix A.1 lists 8 prompts: web search uses “write a passage to answer the question”, SciFact uses “write a scientific paper passage to support/refute the claim”, FiQA uses “write a financial article passage”. HyDE is not strictly fully zero-shot; the notion of “what counts as a relevant document” is injected through prompts. The paper attributes the gaps on FiQA and DBPedia generally to instructions being insufficiently specific.
Wrap-up
HyDE is a simple and effective baseline in unlabeled settings: a single prompt yields performance close to a fine-tuned dense retriever. Methodologically, it splits retrieval into two decoupled parts, “LLM handles relevance + encoder handles similarity”, and this split has been reused repeatedly in later work.