ICML 2025 Outstanding Paper: Going Beyond the Creative Limits of Next-Token Prediction

ICML 2025 named 8 Outstanding Papers, and Roll the Dice & Look Before You Leap: Going Beyond the Creative Limits of Next-Token Prediction is one of them. The authors are from CMU and Google Research. The paper asks a blunt question: why do LLMs keep producing bland, repetitive output on open-ended tasks like writing puns, designing Olympiad problems, or brainstorming research ideas?

The authors’ core argument: on these tasks humans first commit to an abstract idea and then generate content around it; next-token prediction (NTP) cannot learn this pattern. To fix it, you first have to swap out the training objective so the model can actually learn that hidden idea, and then move the inference-time randomness from the output side to the input side, so the idea is not torn apart by per-position noise during sampling.

Paper: arxiv.org/abs/2504.15266

Problem and task design

To write a pun, you first have to think of “two words that sound alike but mean unrelated things,” and then craft a sentence around them. To design an Olympiad problem, you first have to pick “an obscure geometric property as the key trick,” and then dress it up as a problem statement. To pitch a research idea, you first have to randomly land on an interesting angle, and then write the motivation and method around it. Two features are common to these open-ended tasks. First, there is no single correct answer; quality is judged on coherence, originality, and diversity. Second, before producing anything you need an invisible “roll the dice” step: pick an abstract goal or core idea (a leap of thought), then generate content around it. The leap is invisible at the surface of the final text; the model can only infer it from many samples.

LLMs tend to produce bland, repetitive output on these tasks. The authors suspect the root cause is that NTP cannot learn the “first commit a leap, then generate around it” pattern. NTP fits a left-to-right per-token conditional distribution, and whether it has the capacity to internally compress “first sample a latent variable” into its weights is an open question.

But there is no quantifiable way to verify this on real writing (what counts as “a creative Olympiad problem”? Even consistency is hard to judge automatically). So the authors take a step back: keep the essence “the output requires an invisible leap upstream of generation,” strip everything else (language, semantics, coherence) away, and leave the bare algorithmic skeleton. That is the design rationale for the four tasks. Each task has an explicitly hidden leap, and the legality, originality, and diversity of generations can all be evaluated precisely.

The tasks are organized along the cognitive science taxonomy of creativity (Boden, 2003), with two tasks per category:

  • Combinational creativity (find new connections among known concepts; analogies, puns): the model first memorizes some “concept-to-concept associations,” then generates new legal combinations not seen in training. The paper uses Sibling Discovery and Triangle Discovery as proxies.
  • Exploratory creativity (construct new patterns under rules; problem design, plotting): there is no pre-existing concept pool, the model has to construct a legal structure from scratch under some constraint. The paper uses Circle Construction and Line Construction as proxies.

Two tasks per category cover a difficulty gradient. There is also a deeper design choice: the last three tasks are insensitive to token order (permutation-invariant), and the next section explains why that is fatal for NTP.

The evaluation metric is algorithmic creativity: sample a batch of generations from the model, count the fraction that are simultaneously “satisfying the task constraint + not in the training set + not duplicated within the sample.” All three conditions are required. Satisfying the constraint alone is easy (just memorize the training set); requiring originality and diversity on top of that is what shows the model has actually learned the underlying generative rule.

The four minimal tasks

The four tasks share a common template: at training time the model memorizes the structure of a graph into its weights (in-weights graph), and at generation time it has to emit a tuple satisfying some relation. “Creativity” here means emitting new legal tuples never seen during training.

Sibling Discovery. The model’s weights store a bipartite graph (a set of parents, each with several children). Training samples are (child A, child B, parent), e.g. (dog, cat, mammal). Task: emit a new legal triplet. The hidden leap is the parent, but training data deliberately puts the parent last, forcing NTP to spit out two children before filling in the parent, the exact opposite of “first pick a topic, then give examples.”

Triangle Discovery. One step harder than Sibling: the graph is a general graph (no longer bipartite), and samples are triangles $(v_1, v_2, v_3)$ in the graph. The three vertices mutually constrain each other; there is no single “primary latent variable,” and the model has to coordinate the existence of three edges at once.

Sibling Discovery and Triangle Discovery

Circle Construction. Given $n$ anonymous vertices, the model emits a set of edges that can be reordered into an $n$-vertex cycle. For example, on 8 vertices the output (3,5),(5,2),(2,7),(7,1),(1,6),(6,4),(4,8),(8,3) reorders into the cycle 3-5-2-7-1-6-4-8-3. The hidden leap is that the model must have a sketch of “what the cycle looks like” in its head; otherwise edges can only be assembled one at a time, and late additions tend to break the cycle.

Line Construction. A simplified Circle: the goal is a chain instead of a cycle.

Circle Construction and Line Construction

The permutation-invariance of the last three tasks comes back to bite NTP exactly as foreshadowed above: there simply is no token that “naturally should appear first,” so the trick of “reorder the training data so the leap appears earlier” cannot be used to rescue NTP. Sibling can still be argued away with “just put the parent first,” but these three tasks block off even that escape route.

Why NTP fails

Take Sibling as the example. The ideal generative procedure is to first sample a latent $z := \gamma$ (the parent), then sample both children at once via $p(\alpha, \beta | z)$, requiring roughly $O(m \cdot n)$ training samples ($m$ parents, $n$ children each).

But NTP learns token by token in data order. The first token $\alpha$ has no context, so it can only fit the marginal. The second token $\beta$, conditioned on $\alpha$, has two ways to be learned:

  • The proper way: internally represent which parent $\alpha$ belongs to, then pick $\beta$ from that parent’s set of children
  • The shortcut: skip the parent layer entirely and learn “given $\alpha$, find a token that often co-occurs with it as $\beta$”

The shortcut is simpler, and the simplicity bias of neural networks grabs it first. The paper calls this the Clever Hans cheat, named after the horse “Clever Hans” who was supposedly doing arithmetic but was actually reading micro-expressions on his trainer’s face for the answer.

Once the shortcut is locked in, the parent at the third position becomes useless: the model can already predict it without the parent, and the gradient signal for actually learning the parent is essentially zero. The result is that the model only learns surface co-occurrence rules, never the latent plan (the leap), and data efficiency degrades from $O(m \cdot n)$ to $O(m \cdot n^2)$.

The crucial point is that on Triangle / Circle / Line, the three permutation-invariant tasks, no token reordering can save NTP. The “just put the leap first” defense, which is barely defensible on Sibling, completely collapses on these three.

Training side: replace NTP so the model can learn a latent plan

The root cause of NTP’s failure is “learning token by token, where shortcuts at earlier positions wreck the later ones.” The fix is to make the model predict the entire output in one shot, instead of generating one token at a time conditioned on previous ones. Earlier positions no longer have a “ground-truth value already written down” to lean on; the model has to plan the whole output during prompt encoding. The paper compares two approaches.

Teacherless training. Standard NTP training relies on teacher forcing: when training the 5th token, the previous 4 positions are fed ground-truth values, and the model only has to “guess the next token given the known prefix.” Teacherless yanks this crutch away: the previous 4 positions are fed dummy masks, and the model must predict every token correctly using just the prompt. This forces the model to “think through” the entire output during prompt encoding.

Diffusion (the paper uses SEDD, Score Entropy Discrete Diffusion). At training time the target sequence is corrupted to varying degrees (lightly: replace a few tokens; heavily: mask everything), and the model learns to reconstruct the original from the corrupted version. At inference it starts from a fully masked sequence and iteratively repairs it. Teacherless is the extreme special case of diffusion (training only on the “fully masked” corruption level); diffusion adds intermediate corruption levels, which act as a graded sequence of sub-tasks and provide smoother gradients, making it more stable than teacherless at small scale.

The shared insight: replace “per-token conditional prediction” with “joint prediction of the whole sequence,” forcing the latent plan to be learned explicitly.

Two experimental insights are worth pulling out. First, memorization and creativity are dual: on Gemma v1 2B, teacherless raises algorithmic creativity while memorization drops sharply. This in turn confirms the mechanism. When NTP cannot learn a plan, the only way to fit the training data is to memorize examples; multi-token training forces the model to learn global patterns, naturally leaving fewer memorization traces. Second, scale determines which method wins: on Gemma 2B, teacherless performs best, but on a small model like GPT-2 86M, teacherless is hard to optimize, and diffusion is more stable. The reason is that the teacherless target distribution has much higher variance than NTP, and small models don’t have the capacity to optimize through it.

On Gemma v1 2B, teacherless raises creativity while pushing memorization down

Transfer to real data is still weak. On XSUM summarization, teacherless gives a small but consistent boost on the self-BLEU diversity metric, but the gain is not visible on CNN/DailyMail. The strong signal on algorithmic tasks has not yet been reproduced on real text generation, as the paper itself acknowledges.

Inference side: move the randomness from the output to the input

After swapping the training objective the model has the capacity to encode a latent plan into its weights, but a second problem remains at inference time: how do you make sure the plan actually gets “sampled out” during decoding? This section continues from the previous one; it is not a separate track.

Standard temperature sampling raises the entropy of the output distribution at every token position, giving the model a chance to deviate from the argmax. The problem is that this randomness is distributed across every step: the model cannot, at the first step, lock in “which plan I’m going with this time”; it can only roll a small die at every position as it decodes, and the random decisions across positions are uncoordinated. For tasks that depend on “first commit a leap, then generate around it,” this kind of per-position noise easily tears the plan apart, and the output either collapses into a single mode or becomes internally inconsistent.

Seed-conditioning moves the randomness to the input side: at training time, prepend a meaningless random string as a seed to each sample; at inference time, swap in a fresh seed and decode greedily (no output noise, taking the argmax at every step). One throw of the dice settles it for good.

The core insight is to decouple diversity from determinism: the seed decides which plan to follow, and decoding itself is fully deterministic. This sidesteps the fundamental difficulty of temperature sampling, which forces the model to maintain marginal distributions over multiple plans at every step, the equivalent of asking it to “think about several things at once.”

The paper concedes the mechanism is not fully understood (“We do not understand this presently”), but experimentally seed-conditioning combined with greedy decoding matches or beats temperature sampling on Gemma 2B. With this step on top of the training-objective fix, the full path is complete: the training stage lets the model learn the plan, the sampling stage keeps the plan from being torn apart by noise.

Some open questions

Transfer to real data is the most important. The insights from the algorithmic tasks have so far only shown a faint signal on XSUM, with no replication on CNN/DailyMail. If the effect size stays this small, the practical headroom of the method on real text generation will need more evidence to back it up.

Relation to reasoning-style methods is the most interesting open question the paper leaves behind. The authors explicitly limit their scope to the supervised setting and do not pass judgment on RL, CoT, or test-time compute. But they do note in passing a logic worth chasing: if post-training only elicits capabilities already in the base model, then the creativity ceiling of the base model is the ceiling of the entire pipeline, and stacking more reasoning tricks on top will not save it. They leave this point to future work.

Summary

The paper does two things. First, problem diagnosis: it uses four minimal algorithmic tasks to strip “open-ended generation requires sampling a latent variable” down to the quantifiable algorithmic creativity metric, gives a controllable counterexample for NTP, and identifies the mechanism as simplicity bias driving the model into a Clever Hans shortcut that skips the latent plan. Second, a two-step solution: first use teacherless / diffusion to change the training objective, giving the model the capacity to learn a plan; then use seed-conditioning to change the sampling mechanism, so the plan is not torn apart by per-position noise at decoding time. The two steps fit together as one path, and neither step alone is enough.