Transformer Succinctness: The Other Side of Expressiveness
One of the two Outstanding Paper awards at ICLR 2026 went to a purely theoretical work: Transformers are Inherently Succinct. A paper with no experiments, no benchmarks, nothing but mathematical proofs, and it won best paper. The review committee’s rationale: “it offers a new perspective for explaining the power of the Transformer architecture.” The original paper is dense, drawing heavily on formal language theory and complexity theory. This post attempts to lay out the core results and construction ideas in a more intuitive way.
What is this “new perspective”? Past work compared expressiveness, i.e. which model can recognize a broader class of languages. This paper asks a different question: for the same language, which model can describe it using less space?
The paper summarizes its main result in one line:
Transformers can describe concepts extremely succinctly.
Not a little shorter. Absurdly shorter. For the same language, the description size required by a finite automaton can be doubly exponential in the size of the Transformer. If a Transformer describes a language with 10 symbols, a finite automaton may need 2^(2^10), roughly 10^308, states to express the same thing. That gap dwarfs the number of atoms in the observable universe.
Why This Matters
A well-known theoretical result says that Transformers actually recognize fewer language types than RNNs. Fixed-precision Transformers can only handle a class called star-free regular languages, while fixed-precision RNNs handle all regular languages. In terms of expressive range, RNNs are strictly more powerful. Yet in practice, Transformers dominate RNNs across the board — a mismatch that has long lacked a good theoretical explanation.
Previous attempts all followed the expressiveness line, trying to prove under various conditions that Transformers can do this or that. This paper argues we may have been evaluating architectures along the wrong dimension. Expressiveness (what you can do) and succinctness (how large a model you need to do the same thing) are two independent dimensions. RNNs win on the first. Transformers win on the second by a crushing margin.
The review committee valued precisely this point. The reviews mention that “the sharp conceptual viewpoint drew interest from the committee and other experts,” despite “some criticisms.” A pure theory paper winning Outstanding Paper comes down to asking a sufficiently good question.
What “Describing the Same Language” Means
Some setup. In formal language theory, a “language” is a set of strings. “All strings ending in ab” is a language. “All strings of even length” is a language.
There are many ways to describe the same language. You can draw a finite automaton (a state-transition diagram), write a logic formula (LTL, linear temporal logic), or use a Transformer. Their expressive ranges may differ, but among the languages they can all describe, which one uses fewer “words”? That is the question succinctness answers.
The specific model studied in the paper is UHAT (Unique Hard Attention Transformer), the theoretically simplest variant of a Transformer: hard attention, fixed precision, deterministic selection. Proving succinctness with the weakest Transformer means the conclusions only get stronger for more powerful models.
Three Levels of Succinctness Gaps
The paper establishes three levels of succinctness gaps:
Transformer to finite automata: doubly exponential, two exponential layers stacked. These gaps are tight; they cannot be improved further. Translating from Transformer to LTL incurs at most an exponential blowup (improving the doubly exponential translation from prior work), while translating from LTL to Transformer requires only polynomial size.
The position of RNNs is subtle. Although RNNs recognize more languages, on the star-free languages that both can recognize, Transformers are exponentially more succinct. The reason is straightforward: a fixed-precision RNN is essentially a finite automaton. A d-dimensional state vector with k bits per dimension yields 2^(kd) possible states. The doubly exponential gap from Transformer to finite automata directly implies an exponential gap to RNNs.
The Direct Addressing Power of Attention
A finite automaton scans a string left to right, one symbol at a time, seeing only the current symbol and its internal state. If you need to check whether the value at the current position matches some specific earlier position, the automaton must store that value in its state. When the amount of information to remember grows exponentially with input length, the number of states grows exponentially too.
Attention can “jump” directly to a position satisfying some condition and retrieve the information there. No need to walk through the string step by step, no need to store intermediate information in state.
The paper’s concrete construction has the Transformer verify that a binary counter embedded in a string is incrementing correctly. Each segment of the string contains a binary number and a symbol, separated by #:
| |
To verify the counter increments correctly from 000 to 111, attention just needs to find the previous # for each # position, then compare the two binary numbers bit by bit to check they differ by 1. An n-bit counter has 2^n segments, but checking it requires only a polynomial-size Transformer. A finite automaton doing the same thing must remember the current counter value, needing 2^n states.
The paper pushes this construction to the limit. Through nesting (stacking multiple counters into rows and columns, using vertical constraints to connect positions in the same column across different rows), a polynomial-size Transformer can verify patterns so complex that the shortest valid string is doubly exponential in length. A finite automaton accepting a non-empty language has its shortest string bounded by its number of states. Comparing the two, the doubly exponential gap emerges.
This nested construction effectively encodes a problem called 2^n-tiling: given a set of tiles with colored edges, can you tile a grid with 2^n columns and arbitrarily many rows such that adjacent tiles’ colors match? Attention checks horizontal constraints (finding adjacent # positions) and vertical constraints (finding the # in the same column of the previous row by requiring an exact counter-value match). This is an EXPSPACE-complete problem, encodable by a polynomial-size Transformer.
Succinctness and Verification Complexity
The more compact the description, the harder the analysis. Can a finite automaton accept at least one string? Solvable in polynomial time. An LTL formula? One level harder. A Transformer? EXPSPACE-complete, meaning the memory required to solve it grows exponentially with input size, placing it among the theoretically hardest class of problems. Equivalence checking (whether two Transformers recognize the same language) is also EXPSPACE-complete.
Same principle as file compression: the higher the compression ratio, the more resources decompression requires. Transformers are an extremely high-compression language description method, and the cost is that analyzing them becomes extremely difficult. Even answering the most basic question, “does this model accept anything at all?”, cannot escape this complexity.
One interesting subtlety: restricting the Transformer to unidirectional attention (looking only left, not right) reduces the complexity. The EXPSPACE-completeness comes from mixing attention directions, and this flexibility is what enables such extreme compression.
Distance from Real LLMs
The paper studies fixed-precision hard-attention Transformers, which differ from the softmax attention and floating-point precision used in actual LLMs. Recent work (Sälzer et al., 2026) shows that verification of non-fixed finite-precision softmax Transformers is undecidable, hinting that the softmax version’s succinctness may be even more extreme, though no rigorous corresponding results exist yet.
Between worst-case succinctness and practical parameter efficiency lie two further barriers: optimization and generalization. The paper provides a structural advantage at the representation level, which does not directly equate to “Transformers train better.” But it does establish one thing clearly: among all models capable of expressing star-free languages, Transformers are the most space-efficient. A pure mathematical statement, independent of any experimental assumptions.
The paper closes with a linguistic analogy: by Zipf’s law of abbreviation, frequently used concepts tend to acquire shorter descriptions. The Arabic numeral system is exponentially more succinct than Roman numerals, and it is arguably this succinctness that made modern mathematics and computer science possible. Is the succinctness of Transformers similarly enabling LLMs in some fundamental sense? No answer yet.