Differentiable Search Index: A Brief Reading
DSI (Differentiable Search Index) is one of the earlier representative papers on generative retrieval, published at NeurIPS 2022. Its core approach: encode the entire document corpus into the parameters of a Transformer, and at retrieval time, decode the document ID directly with a seq2seq model, removing the inverted index, vector store, and nearest-neighbor search as separate components.
The earlier GENRE (De Cao et al., 2020) had already used a seq2seq model to autoregressively decode Wikipedia entity titles, and DSI cites it as related work. DSI’s further contribution is extending the decoding target from semantically meaningful entity names to arbitrary forms of docid (including random integers and hierarchical semantic IDs), and systematically comparing document representations, ID representations, and training strategies. This reframes retrieval from a systems engineering problem into an end-to-end machine learning problem: indexing is training, retrieval is inference.
Paper: arxiv.org/abs/2202.06991
Basic setup
Conventional retrieval is two-stage: encode query and document into vectors separately, then run MIPS (Maximum Inner Product Search) for nearest neighbors. Dual Encoder represents this approach and remains the industry default.
DSI collapses both stages into one: train a seq2seq model that takes a query as input and emits a docid directly.

Document content is stored in the model weights. Training has two tasks: indexing (doc → docid) makes the model memorize the mapping between each document and its ID, while retrieval (query → docid) maps a query to the corresponding document. Both share parameters and are jointly trained as multi-task.
On NQ320K, an 11B-parameter DSI with semantic docids reaches 40.4% Hits@1, compared with 24.3% for a same-size Dual Encoder.
Docid representation
Atomic docid: assign each document a unique token, expanding the output softmax dimension accordingly. Direct, but on NQ320K the docid vocabulary needs to scale to roughly 320K, which doesn’t scale well and makes training unstable.
Naive string docid: treat the integer ID as a string (e.g., “237491”) and decode it character by character. Workable, but docids carry no semantic relationship between each other, so the model can only rote-memorize.
Semantic structured docid: encode all documents with BERT into vectors, then k-means cluster into 10 groups. Each document gets the digit of its top-level cluster (0-9) as the first character of its docid. If a cluster contains more than 100 documents, recursively cluster within it into 10 sub-clusters. Each document ends up with a hierarchical string like “233”.

The entire corpus is organized as a trie, and beam search walks down the tree structure. Each decoded digit shrinks the search space by a factor of ten.
The difficulty of retrieval is closely tied to how the target object is named. Random integer IDs amount to making the model learn a lookup algorithm from scratch. Hierarchical semantic IDs hand the model a pre-clustered decision tree. Under the same model capacity, the latter is significantly easier to optimize.
The scaling curve makes this difference especially clear:

Dual Encoder hardly improves with scale, gaining less than 4 points from 220M to 11B. DSI Naive benefits from scale, going from 6.7 at 220M to 23.8 at 11B, but its absolute number only catches up to DE at 11B. DSI Semantic starts on par with DE but climbs more steeply, ending about 66% higher than DE at 11B (40.4 vs 24.3).
The architecture didn’t change. Only the docid naming did, and the scaling behavior diverged completely.
Indexing as a training objective
In a traditional retrieval system, indexing is offline preprocessing and retrieval is online runtime. The two are completely separate.
DSI treats both as seq2seq training objectives. The authors found that two-stage training (indexing first, then retrieval) is mediocre, and mixed multi-task training is significantly better, with an optimal indexing-to-retrieval sample ratio of about 32:1.
The reason is that the two tasks aren’t on equal footing. Retrieval is parasitic on indexing: without docids being assigned meaning first, the retrieval stage just predicts meaningless tokens.
There’s another phenomenon in the appendix. Catastrophic forgetting in DSI training is quite visible: right after a batch of documents is indexed, retrieval accuracy on that batch peaks; by the time training cycles back to that batch one epoch later, accuracy has dropped sharply. The training process shows regular peaks and troughs, and the final checkpoint chosen actually corresponds to one of the peaks.
The storage semantics of model weights and a database are not the same. Inserting a row in a database doesn’t affect other rows, while updates to model weights are global, and adding a new document may displace something already memorized.
Effect of document length
Another finding is the input document length during indexing. The authors compared three truncation schemes: 32, 64, and 128 tokens.

L=32 significantly outperforms L=128, L=64 is comparable to L=32, and the drop happens past L>64. The result seems counterintuitive at first, since more tokens usually means more information. The paper attributes it to optimization and memorization becoming harder as the number of tokens grows.
One plausible intuition: what the model actually memorizes is not document content itself but the doc → docid mapping. Shorter text yields a tighter “fingerprint” that binds more cleanly to a single docid; longer text introduces extra noise.
Similar “shorter units align better” experience recurs in later retrieval work. RAG systems generally use chunks rather than whole documents, mainly for context-length and recall-granularity reasons, and the effect of indexing granularity itself is one piece of the picture.
Zero-shot retrieval
In the zero-shot setup, the model is trained only on indexing, not on retrieval, and never sees any (query, docid) labels.
Result: DSI (atomic docid) on NQ320K gets 25.1% Hits@1, BM25 11.6%, SentenceT5 16.9%.
With no signal at all telling it which query corresponds to which doc, just from indexing alone (memorizing document content), the model can map natural-language questions to the correct document at test time.
One possible explanation is that indexing implicitly learned some semantic alignment: the encoder produces similar representations for “questions” and “answer documents”, so the docid probability distribution triggered by the decoder is also close, equivalent to a dual encoder without an explicitly constructed vector space. The paper does not spell out this interpretation.
Wrap-up
When DSI was proposed, RAG hadn’t become mainstream yet. Subsequent development took a different path: RAG became the default, vector databases entered the infrastructure layer, and DSI as a line of work didn’t really spread in search. The reasons come down to three.
Incremental updates. Production corpora change daily. DSI requires retraining to add a single document; a vector store just inserts a row. The Conclusion of the paper itself flags dynamic corpora as important future work.
Interpretability. Vector retrieval can give a reason for recall (high inner product); DSI just decodes a docid directly, making debugging harder.
Scale. The paper’s experiments cap at 320K documents and T5-11B. Internet-scale corpora are difficult to support under that parameter budget.
But the question DSI raised hasn’t aged. Are retrieval and generation the same thing? An LLM answering closed-book QA, and DSI mapping the same question to a docid: where does the actual difference live?
The analogy isn’t airtight. The paper’s Related Work draws a clear line: CBQA outputs the answer itself, while DSI outputs an ID that locates a document; docids are arbitrary symbols that need indexing to acquire meaning, while answer tokens already carry semantics from pretraining. Even so, both rely on some form of “parameterized memory”: corpus knowledge is compressed into weights and pulled back out by decoding.
DSI hasn’t seen large-scale adoption in search, but in recommendation, the paradigm of encoding items into semantic IDs and decoding recommendations directly with seq2seq has had several real-world deployments in recent years. The thread “indexing is training, retrieval is inference” hasn’t gone away. Discussions about an LLM’s knowledge cutoff can in some sense be viewed as discussions of what the model has indexed, while RAG patches the parts that the model didn’t index well with an external store.
DSI hasn’t become the mainstream solution for now, but generative retrieval may well be the future of retrieval.