ANCE: Training Dense Retrievers with Full-Corpus ANN Hard Negatives

Publish on: 2026-05-16 Classify at: NLP Read:≈ 7min Views: Comments:

ANCE (Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval, 2020, Microsoft, ICLR 2021) is an old 2020 paper, but the standard practice of hard negative mining in dense retrieval training basically took shape here, and DPR variants, E5, and BGE all inherit the same framework. The core idea: maintain an ANN index throughout training, sample the negatives that the current model finds hardest to distinguish from the whole corpus, and refresh the index asynchronously on a fixed schedule.

The paper has two contributions. First, an importance-sampling variance analysis showing that the in-batch and BM25 negatives commonly used in dense retrieval training have near-zero gradient norms and are the convergence bottleneck. Second, a concrete recipe for sampling hard negatives from the full corpus via ANN, together with a solution to the engineering bottleneck of “the index has to stay in sync with the model.” On TREC 2019 DL, MS MARCO, NQ, and TQA, a BERT-Siamese backbone trained with ANCE reaches NDCG@10 0.628 (MaxP) on document retrieval, MRR@10 0.330 on passage retrieval, beats DPR on Top-20 Coverage on both NQ and TQA, and improves offline retrieval quality by roughly 14%~15% on an 8B-corpus production setting.

Paper: arxiv.org/abs/2007.00808

Why dense retrieval struggles against BM25

Dense retrieval (DR) encodes queries and documents into dense vectors and retrieves via ANN in vector space. In principle DR can capture semantic matches beyond surface terms, but around 2020 the reality was that end-to-end DR often lost to BM25 on document retrieval.

The paper attributes the gap to negative sampling. The peculiarity of first-stage retrieval is that “irrelevant documents” means the entire corpus $C$ minus the relevant set $D^+$, on the order of millions to billions, and only a small batch can be sampled at a time. Three common recipes exist: BM25 top documents as negatives, in-batch positives of other queries as negatives, or a mix of both. None of these work particularly well in DR, and the paper asks why.

Why local negatives fail, from a gradient-variance angle

The authors write the one-step SGD convergence rate via importance sampling:

$$E_t = 2 E_{P_{D^-}}(g_{d^-})^T(\theta_t - \theta_*) - 2 E_{P_{D^-}}(g_{d^-})^T E_{P_{D^-}}(g_{d^-}) - 2 \text{Tr}(V_{P_{D^-}}(g_{d^-}))$$

where $g_{d^-}$ is the weighted gradient on negative $d^-$. The key term is the last one: subtract the variance of the gradient estimator. To make the variance small and convergence fast, the optimal sampling distribution is

$$p_{d^-} \propto |\nabla_{\theta_t} l(d^+, d^-)|_2$$

sample in proportion to gradient norm. Negatives with large gradient norm push the loss the most, the informative ones.

For common BCE and pairwise hinge losses you can verify $l(d^+, d^-) \to 0$ implies the gradient norm $\to 0$, so negatives whose loss is already near zero contribute almost nothing to training. Let $D^-$ be the set of truly hard-to-distinguish negatives and let $b$ be the batch size. DR training satisfies two empirical facts: $b \ll |C|$ and $|D^-| \ll |C|$. Multiplying them, the probability that a random mini-batch contains at least one informative negative is approximately $\frac{b \cdot |D^-|}{|C|^2}$, which is near zero. The conclusion is clear: the DR training space is too large and hard negatives too sparse, so looking inside the batch is hopeless.

Full-corpus ANN hard negatives with asynchronous index refresh

ANCE’s training objective is

$$\theta^* = \arg\min_{\theta} \sum_q \sum_{d^+ \in D^+} \sum_{d^- \in D^-_{\text{ANCE}}} l(f(q, d^+), f(q, d^-))$$

$D^-_{\text{ANCE}}$ is drawn from the top ANN retrieval results of the current DR model over the whole corpus, with true positives removed. By the previous section, these are exactly the samples with the largest gradient norm and the highest training contribution.

The engineering challenge is that the model updates every mini-batch, so in principle the ANN index should also update every step, which is completely infeasible if you have to re-encode the whole corpus and rebuild the index. ANCE borrows REALM’s approach: two parallel processes, a Trainer that keeps training and an Inferencer that takes the most recent checkpoint $f_k$ to re-encode the whole corpus and rebuild ANN. While waiting, the Trainer keeps training on negatives sampled from the older index $\text{ANN}_{f_{k-1}}$, and only switches to the new index every $m$ batches.

The cost is that the Trainer’s negatives are always slightly “stale.” The paper uses a 1:1 GPU split (4 for training, 4 for inference), and empirically a 10k-batch refresh interval introduces a level of staleness that has very little effect on training.

The implementation uses a BERT-Siamese dual-encoder, dot-product similarity, and NLL loss. Two variants handle long documents: FirstP takes the first 512 tokens; MaxP splits a document into up to four 512-token segments and takes the max segment score. All DR models are first warmed up with BM25 hard negatives before switching to ANCE; otherwise the cold-start ANN is near random and cannot produce useful hard negatives.

Experimental results

TREC 2019 DL Track document retrieval NDCG@10: BM25=0.519, the strongest sparse method DeepCT (uses BERT to learn token importance and bakes it into the inverted index)=0.554, the mixed-negative DR baseline BM25+Rand=0.557, ANCE FirstP=0.615, MaxP=0.628. The key observation the paper stresses is that across all DR negative-construction schemes, only ANCE lets BERT-Siamese consistently beat sparse methods on document retrieval.

Passage retrieval MRR@10: BM25=0.240, BM25 Neg=0.299, the DPR-style BM25+Rand=0.311, ANCE FirstP=0.330.

OpenQA single-task Top-20 Coverage on NQ and TQA: ANCE 81.9 and 80.3, vs DPR 78.4 and 79.4. Attaching a reader (RAG-Token on NQ, DPR Reader on TQA) the end-to-end answer accuracy goes from 41.5 to 46.0 on NQ and from 56.8 to 57.5 on TQA.

Production search: 250M corpus, 768 dim, exact KNN gives a relative improvement of about 18.4%; 8B corpus, 64 dim, KNN gives 14.2% and ANN 15.5%.

Efficiency: ANCE offline inference (encoding the full corpus + building ANN) takes about 10h, but online retrieval with 100 documents per query takes just 11.6ms, about 100x faster than BM25 + BERT rerank’s 1.42s.

Representation-space consistency and convergence behavior

The most compelling evidence is that the gap between reranking and retrieval is flattened. The same model can be used as a reranker (rerank BM25 top-100) or as a retriever (full-corpus ANN), and the two scores usually differ a lot. ANCE pulls them almost together (FirstP doc: rerank 0.641 vs retrieval 0.615; MaxP: 0.671 vs 0.628), while other DR baselines have larger gaps between the two. In other words, the representation space ANCE learns finds relevant documents during reranking and also finds them during full-corpus ANN retrieval. The paper uses this to argue that “interaction-based BERT is not required,” a correction to the dominant intuition in neural IR at the time.

On convergence, the ANCE training loss stays high throughout training, while local negatives like NCE Neg and Rand Neg quickly drop to near zero. Gradient norms with ANCE are orders of magnitude larger than with local negatives, matching the theory above. The paper also measures the overlap of local negatives with “truly hard negatives at test time”: NCE Neg and Rand Neg both 0%, BM25 Neg 15%, ANCE 63% at the start and by definition 100% at convergence.

Refresh frequency, warm-up, and divergence from sparse retrieval

The async refresh interval is a key hyperparameter. The paper uses every 10k batches with a 1:1 GPU split. 5k is more stable; 20k can fluctuate under a high learning rate but still works with a reasonable LR. Skewing the GPU split too far toward the Trainer makes the index visibly stale.

BM25 warm-up is mandatory. A cold-start ANN is near random and cannot produce hard negatives, so ANCE is not a standalone training paradigm but a two-stage pipeline of sparse-negative warm-up + dense hard-negative fine-tuning.

DR and sparse retrieval results overlap less than one might expect. ANCE and BM25 Top-100 overlap is only 17%~25%, suggesting dense retrieval learns a different similarity, not an approximation of BM25. The case studies show that ANCE’s irrelevant retrievals tend to be “semantically related but off-topic,” a different failure mode from sparse retrieval.

Summary

ANCE does two things. First, diagnosis: an importance-sampling variance analysis shows that in-batch and BM25 local negatives in DR training have vanishing gradient norms, which is a convergence bottleneck rather than an engineering detail. Second, the recipe: sample hard negatives from the full corpus via an asynchronously refreshed ANN index, paired with BM25 warm-up to solve the cold start.