Why Neural Networks Can Represent Lanugage Models

My first understanding of a language model is originated from n-gram. When I know RNNLM, I have a question: why a neural network can represent a language model?

After some research, I found the answer. Essentially, language model is a probability distribution:

A statistical language model is a probability distribution over sequences of words.

In a language model, the probability of a sequence \(w_1, w_2, \ldots w_m\) can be represented as:

\[P(w_1, w_2, \ldots w_m) = \prod^{m}_{i = 1}{P(w_i|w_1, w_2, \ldots w_{i-1})}\]

Then the problem reduces to: can a neural network represent a probability distribution? This article mentioned this point:

Neural networks are universal function approximators that can approximate any functions to arbitrary precisions

Moreover, I found the # Universal approximation theorem:

In the mathematical theory of artificial neural networks, universal approximation theorems are results that establish the density of an algorithmically generated class of functions within a given function space of interest. Typically, these results concern the approximation capabilities of the feedforward architecture on the space of continuous functions between two Euclidean spaces, and the approximation is with respect to the compact convergence topology.

The theorem proves the above statement: neural networks can approximate any functions with any precisions. Unfortunately, the theorem only points out the existence of such network, but do not provide a way to contruct it.

The universal approximation theorem is the foundation of why neural networks can represent language models.