ICLR 2026 Outstanding Paper: LLMs Get Lost in Multi-Turn Conversation

Publish on: 2026-05-02 Classify at: NLP Read:≈ 10min Views: Comments:

Another ICLR 2026 outstanding paper, from Microsoft Research and Salesforce Research. The title is direct: LLMs Get Lost In Multi-Turn Conversation. The conclusion is also direct: take any single-turn task, split the instruction across multiple turns, and 15 frontier LLMs (including GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, o3, DeepSeek-R1) all show degradation, averaging -39%. The paper specifically emphasizes that the unreliability increase has roughly the same magnitude across all models, regardless of scale, reasoning capability, or open/closed weight.

What’s more interesting is the breakdown: aptitude only drops about 15%, while unreliability surges 112%. In other words, models in multi-turn settings aren’t dumber, they’re inconsistent. The same task run ten times produces best-case and worst-case results separated by 50 points.

Paper: arxiv.org/abs/2505.06120

Limitations of single-turn benchmarks

Almost all current LLM evaluations are single-turn: hand the model a complete, fully-specified instruction and let it produce an answer in one shot. MMLU, HumanEval, GSM8K, MT-Bench all work this way. Even the so-called multi-turn evaluations (like the second turn in MT-Bench) are what the paper calls “episodic”: each turn is effectively an isolated subtask that can be graded standalone.

But that’s not how real users behave. Herlihy et al. 2024’s analysis of conversation logs notes that underspecification is the norm in real interaction: users often start with half a thought to test the waters, then add details based on the model’s response. In this regime, the model has to fuse clues scattered across turns to solve the problem at all.

The paper takes the gap between these two paradigms as its starting point: if you take the same complete instruction and split it into “shards” of information, with the user revealing one per turn, can the model preserve its single-turn capability?

Sharded simulation: experimental design

The hardest design problem here isn’t “how to simulate a user.” It’s “how do we make single-turn and multi-turn scores directly comparable.” Prior multi-turn benchmarks were episodic; each turn is its own subtask, so single-turn and multi-turn scores aren’t measuring the same thing, and any drop is hard to attribute. The authors’ core idea, in one line: keep the original instruction and evaluator unchanged, only vary how the information is revealed, then watch how scores move.

Three settings spin out from that core: FULL gives the entire instruction at once (single-turn upper bound); SHARDED splits the same instruction into shards revealed one per turn (the real multi-turn condition); CONCAT concatenates all shards into one bullet list and feeds it in one shot (a control to rule out “sharding loses information” or “rephrasing changes comprehension”). All three share the same underlying task and same evaluator, so the FULL−SHARDED gap can be cleanly attributed to the single variable: underspecified multi-turn.

The pipeline has four steps:

Sharding: a semi-automated process splits single-turn instructions from existing benchmarks (HumanEval, LiveCodeBench, Spider, GSM8K, ToTTo, BFCL, Summary of a Haystack) into shards, each an atomic unit of information; concatenating all shards is equivalent to the original instruction. The CONCAT setting validates this post-hoc: scores stay within ~5% of FULL on average, confirming sharding doesn’t lose information.
Three LLM roles: the assistant is the model under test; the user is GPT-4o-mini holding the full sharded instruction, deciding which shard to reveal next based on conversation history and rephrasing it lightly; the system tags each assistant reply (clarify / hedge / answer attempt etc., seven categories) and extracts/scores answer attempts. The assistant model doesn’t know it’s talking to a simulator and isn’t told the conversation will be underspecified.
Hard constraint: ≤1 shard per turn. Models must wait for the next turn to receive the next piece of information, with no way around the gradual-reveal setup.
Scoring: the conversation’s final score is the max across all answer attempts. This actually gives SHARDED a “bonus”: single-turn allows one shot, multi-turn allows N. Even so, SHARDED still trails FULL significantly.

Below is a panel from Figure 4 in the paper, putting the original (Fully-Specified) and sharded versions of all six tasks side by side:

Take the Math column (GSM8K). The FULL prompt is:

Josh decides to try flipping a house. He buys a house for $80k and then puts in $50k in repairs. This increased the value of the house by 150%. How much profit did he make?

After sharding, this becomes 5 shards revealed one per turn: first “my friend Josh sold his home, I want to know how much profit he made,” then “he bought it for $80,000,” then “he spent $50k on repairs,” then “the house value increased by 150%,” and finally “that’s all I know, what’s his profit?”

At turn 1 the assistant only sees “wants to compute Josh’s flipping profit”, even the purchase price is missing. The paper observes: most models produce a complete answer attempt at turn 1 (“assuming purchase price X and repair cost Y, the profit is Z”), filling in everything from later turns with assumptions. When subsequent shards actually arrive, models don’t tear up the turn-1 framing; they patch it. Even with max-over-attempts scoring, multi-turn loses to single-turn by a lot.

The elegance of this setup is that the original instruction and evaluator never change. Every drop between FULL and SHARDED maps to one variable: how information is revealed.

The full experiment runs 15 LLMs on 6 tasks with N=10 per (model, task), totaling 200K+ conversations and roughly $5,000 in cost.

Magnitude and composition of the drop

Every model drops on every task, with an average of -39%. Aptitude drops 15% on average, unreliability rises 112%. Below are box plots for 15 models across the three settings (FULL / CONCAT / SHARDED):

A few details worth flagging:

CONCAT (concatenating all shards into one prompt) scores essentially on par with FULL (95% on average), which rules out two alternative explanations: that sharding loses information, or that rephrasing affects comprehension. The performance drop really comes from the multi-turn underspecified structure.

Stronger models aren’t more resistant. In Fig 5b, Gemini 2.5 Pro, GPT-4.1, Claude 3.7 Sonnet, all averaging around 80 in FULL, settle into the 65-75 range under SHARDED, much closer to Llama3.1-8B’s 59 than they were in FULL. The paper’s partial explanation: stronger models start higher in FULL, so the absolute drop looks larger; but the unreliability increase is universal regardless of strength.

More test-time compute doesn’t save you. The two reasoning models (o3, DeepSeek-R1) drop just as hard as non-reasoning ones, sometimes more (reasoning models produce 33% longer responses on average, and longer responses carry more assumptions that turn into liabilities).

Four root causes of getting lost

Appendix F gives four observable root causes, each tied to a specific failure mode.

Premature answer attempts. Under SHARDED, most models attempt a complete answer at turn 1 (when information is most scarce). The paper bins conversations by where the first answer attempt falls in the conversation: attempts in the first 20% score 30.9 on average; attempts in the last 80%-100% score 64.4 (this analysis is based only on Code and Math, since for other tasks models almost always answer in turn 1, leaving no buckets to compare). The paper attributes the drop to incorrect assumptions made on underspecified inputs persisting and corrupting later turns; it does not explicitly blame RLHF.

Answer bloat. Later answer attempts are longer than earlier ones. On the Code task, restricting to attempts that actually reached a correct solution, FULL averages 668 characters vs SHARDED’s 850 (about 27% longer). Models build on early attempts based on incomplete information, and even when those attempts are invalidated by later turns, they’re patched rather than discarded. The “final correct answer” is often a Frankenstein of accumulated patches.

Loss-of-middle-turns. In the summary task (each turn introduces new documents, the final summary must cite all of them), the authors measured the share of citations to documents introduced at each turn. The result is a U-shape: documents from turn 1 and the last turn dominate citations, middle turns are neglected. This is the same phenomenon as Liu et al.’s “Lost in the Middle” attention U-shape over long context, just transposed onto turns (the paper only ran this analysis on the summary task; whether it extends to other tasks is unverified).

Verbosity. SHARDED responses are 20-300% longer than FULL. The extra length is mostly self-generated assumptions. Once those assumptions are written down, they become “context anchors” for subsequent turns, often more salient than what the user actually said.

The four phenomena often co-occur in a single conversation and amplify each other: an early answer anchors a wrong assumption, models patch the assumption repeatedly which becomes bloat, and bloat pushes the actual user turns into the low end of the attention U-shape. This causal chain is a reasonable interpretation after seeing all four phenomena together, not an experimental conclusion the paper explicitly draws.

Why intuitive remediations fail

The authors tried several intuitive remediations:

RECAP: add one final turn that consolidates all shards into a bullet list. This gives the model a “second chance,” and is essentially what current agent frameworks (LangChain, AutoGen) are doing by default. There’s improvement, but it falls short of FULL; the bad early answer attempts are still in context interfering.

SNOWBALL: each turn re-presents all historical shards. Improvement is 15-20%, the most practical of the bunch, at the cost of growing context length per turn.

Lower temperature: if T=1 is unreliable, surely T=0 fixes it? In fact, sharded unreliability remains around 30% at T=0. The reason is that small early-turn stochastic divergence cascades; pick a slightly different clarification phrasing in turn 1 and the entire downstream conversation forks. Temperature can’t kill cascading effects.

Prompt hint: tell the model in the system prompt that the conversation may be multi-turn and underspecified, please clarify patiently. Gain: +1%. Negligible.

The failed fixes are more informative than the successful ones. They say multi-turn unreliability isn’t a surface prompt-engineering issue. Either change training (teach models multi-turn strategy in SFT/RLHF), or change the inference protocol (restart the conversation each turn and reconsolidate).

A few extensions

The paper offers recommendations for various audiences, but the more interesting takeaway is the misalignments it exposes.

The first misalignment is between benchmarks and real use. We deploy 90-point models to production, but the 90 was measured under lab conditions; the same task in a real underspecified multi-turn setting might land at 50-60. This gap doesn’t appear on leaderboards and isn’t easy to attribute; when something fails, the user usually blames their own prompt.

The second misalignment is in the responsibility split between agent frameworks and the LLM itself. RECAP and SNOWBALL are essentially “external memory” plugins that let the LLM pretend it’s in single-turn each round. Agent frameworks are popular partly because base LLMs are bad at multi-turn, so the application layer compensates. But the experiment shows agent-style stitching doesn’t fully recover FULL-level performance; the gap is rooted in training objectives, not patchable from above.

The third misalignment is around the reasoning-model promise. o3 and R1 add lots of test-time compute and indeed lead in single-turn, but multi-turn unreliability is just as high. This says reasoning-model “thinking longer” mostly happens after enough information is on the table; it doesn’t help at the meta level of how to handle information being revealed gradually.

For end users, the paper’s actual advice is the most prosaic: if the current conversation feels off, start a new one and write the requirements out completely (“if time allows, try again”); or have the LLM consolidate the existing conversation into a full instruction and reopen with that (“consolidate before retrying”).

These two workarounds are themselves the diagnosis: what the model should do, the user still has to do for now.