


A statistical language model is a probability distribution over sequences of words.

语言模型的建模也比较直观,序列 \(w_1, w_2, \ldots w_m\) 的出现概率可以表示为:

\[P(w_1, w_2, \ldots w_m) = \prod^{m}_{i = 1}{P(w_i|w_1, w_2, \ldots w_{i-1})}\]

问题就变成了神经网络是否可以拟合任意概率分布?之前看到 一篇文章 提到了这点:

Neural networks are universal function approximators that can approximate any functions to arbitrary precisions

进一步看到了通用近似定理 # Universal approximation theorem:

In the mathematical theory of artificial neural networks, universal approximation theorems are results that establish the density of an algorithmically generated class of functions within a given function space of interest. Typically, these results concern the approximation capabilities of the feedforward architecture on the space of continuous functions between two Euclidean spaces, and the approximation is with respect to the compact convergence topology.

