Huggingface有个 # 模型总结 写得不错。

# Language Model

A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability $$P(w_{1},\ldots ,w_{m})$$ to the whole sequence.

$P(w_{1},\ldots ,w_{m})=\prod_{i=1}^{m}P(w_i|w_1,w_2, \dots, w_{i-1})$

# Autoregressive Model

Autoregressive models are pretrained on the classic language modeling task: guess the next token having read all the previous ones.

# Autoencoding Model

Autoencoding models are pretrained by corrupting the input tokens in some way and trying to reconstruct the original sentence.

Masked Language Model (MLM)和Next Sentence Prediction (NSP)就是最典型的自编码预训练任务。BERT在MLM上的训练，就是通过把输入进行一定的掩盖，再重建恢复原始token序列的过程。

# 模型区别与改进

Note that the only difference between autoregressive models and autoencoding models is in the way the model is pretrained. Therefore, the same architecture can be used for both autoregressive and autoencoding models.

• https://www.machinecurve.com/index.php/2020/12/29/differences-between-autoregressive-autoencoding-and-sequence-to-sequence-models-in-machine-learning/
• https://en.wikipedia.org/wiki/Language_model