# The Curious Case of Neural Text Degeneration

# Nucleus Sampling (Top-p Sampling)

The key intuition of Nucleus Sampling is that the vast majority of probability mass at each time step is concentrated in the nucleus, a small subset of the vocabulary that tends to range between one and a thousand candidates.

$\sum_{x \in V_p}P(x|x_{1:i-1}) \ge p$

# Temperature Sampling

Temperature Sampling看名字就更摸不着头脑，实际上思路非常简单，就是直接re-scale原有的解码词分布：

$p(x=u_l|x_{1:i-1})=\frac{exp(u_l/t)}{\sum_{i \in V_l}exp(u_i/t)}$