# DPR

DPR发表在EMNLP 2020，双塔模型，主要idea在于双塔使用了两个独立的BERT，今天看来没什么特别的，但在BERT出来两年的时间，居然没人这么做，也挺令人意外 :-)。

query和doc的相似度定义：

$sim(q, p) = E_Q(q)^TE_P(p)$

# Poly-Encoders

the input context, which is typically much longer than a candidate

query和doc使用两个独立的Transformer进行编码。其中query编码的计算：

$y_{ctxt} = \sum_i w_i \sum_j w_{j}^{c_i}h_j$

# DC-BERT

DC-BERT是一篇SIGIR2020的短文，思路是典型的late interaction。主要贡献是速度快，包括一些小的优化，如query编码只计算一次。query和doc的交互会通过k层Transformer Layer完成，可以通过调整超参数k，达到性能与速度的折衷。

DC-BERT使用两个独立的BERT模型对query和doc进行编码：

DC-BERT contains two BERT models to independently encode the question and each retrieved document.

# ColBERT

ColBERT是SIGIR2020的一篇长文，对已有的工作做了比较全面的总结，本文开始的对比图就取自ColBERT论文。

ColBERT对query和doc的计算是共享同一个BERT，但引入了前缀符[Q]与[D]来标识不同的输入：

We share a single BERT model among our query and document encoders but distinguish input sequences that correspond to queries and documents by prepending a special token [Q] to queries and another token [D] to documents.

ColBERT的主要思想是对query与doc在token-level的编码进行匹配计算，并通过MaxSim算符取出最大值并求和作为最终的分数。 读起来比较拗口，看图和公式更容易理解。

$E_q = Normalize(CNN(BERT("[Q]q_0q_1 \ldots q_l [mask][mask]...[mask]")))$

$E_d = Filter(Normalize(CNN(BERT("[Q]d_0d_1 \ldots d_n))))$

Given BERT’s representation of each token, our encoder passes the contextualized output representations through a linear layer with no activations. This layer serves to control the dimension of ColBERT’s embeddings, producingm-dimensional embeddings for the layer’s output size m.

Filter的含义是去掉标点符号的token表示：

After passing this input sequence through BERT and the subsequent linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols, determined via a pre-defined list.

$S_{q,d}=\sum_{i \in [|E_q|]} \max_{j \in [|E_d|]} E_{q_i} \cdot E_{d_j}^T$

We fine-tune the BERT encoders and train from scratch the additional parameters (i.e., the linear layer and the [Q] and [D] markers’ embeddings) Notice that our interaction mechanism has no trainable parameters.

Even though ColBERT’s late-interaction framework can be applied to a wide variety of architectures (e.g., CNNs, RNNs, transformers, etc.), we choose to focus thiswork on bi-directional transformer-based encoders (i.e., BERT) owing to their state-of-the-art effectiveness yet very high computational cost.