Does RLVR Really Teach LLMs New Reasoning Abilities?

Publish on: 2026-05-17 Classify at: NLP Read:≈ 9min Views: Comments:

After DeepSeek-R1, RLVR (Reinforcement Learning with Verifiable Rewards) has more or less become the standard recipe for “growing reasoning ability out of a small model on its own,” and the ever-rising pass@1 curves make it easy to feel that RL keeps teaching the model new tricks. The Tsinghua LeapLab and SJTU paper Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (NeurIPS 2025 Oral; also ICML 2025 AI4MATH workshop best paper) asks a direct question: is RLVR adding new reasoning ability to the base model, or is it just sampling its existing reasoning paths more reliably?

The authors’ core claim: with a large enough k (128 to 1024), the pass@k of an RLVR-trained model beats base at small k but is consistently overtaken by base at large k. Further coverage and perplexity analyses show that the reasoning paths RLVR outputs already live inside the base model’s sampling distribution. RLVR only sharpens the distribution onto the questions base can already solve, without introducing problems base cannot. Only distillation actually expands the set of problems the model can solve.

Paper: arxiv.org/abs/2504.13837

Measuring reasoning capacity with pass@k instead of pass@1

The usual evaluation looks at the average score under greedy decoding or nucleus sampling, which reflects the kind of answer the model most often produces. But the question of interest is “does the model have the ability to solve this problem at all,” and average scores underestimate rare correct paths that only show up once every few hundred samples. pass@k samples each question k times and counts it solved if any of them passes the verifier; average pass@k is the fraction of questions covered under a budget of k samples, which naturally reflects the upper bound of reasoning capacity rather than the average. The paper uses HumanEval’s low-variance unbiased estimator to estimate pass@k for any k ≤ n from n samples, with n typically 128, 256, or 1024.

Why not best-of-N or majority voting? Those are practical schemes for picking a correct answer, and a verifier or vote can miss correct solutions the model did in fact generate. pass@k does not care about picking, only whether the answer was ever generated, which is what an upper bound on capability really is.

Math problems have a catch: at large k, the model may produce a wrong CoT but accidentally guess the right answer. The authors did a manual CoT inspection on GSM8K and AIME24 questions with accuracy below 5% (more below), and conclude that base model CoTs are genuinely valid, not lucky guesses.

Base overtakes RLVR at large k

Experiments cover Qwen2.5-7B/14B/32B-Base, LLaMA-3.1-8B, Qwen2.5-Math-7B and other bases, paired with RL frameworks like SimpleRLZoo, Oat-Zero, DAPO, Code-R1, DeepCoder, EasyR1, and benchmarks across math (GSM8K, MATH500, Minerva, Olympiad, AIME24, AMC23), code (HumanEval+, MBPP+, LiveCodeBench), and visual reasoning (MathVista, MathVision).

Every curve has the same shape: at small k, RL is above base; once k climbs to tens or hundreds, base catches up and overtakes. On Minerva 32B, base is about 9% higher than RL at k=128. Oat-Zero and DAPO, which beat base by nearly 30% on AIME24 pass@1, are likewise overtaken at large k. Code tasks have no “guess the answer” issue and the result holds. Visual reasoning shows the same pattern.

The training-step experiment is even more direct: as GRPO trains longer, training-set pass@1 climbs from 26.1 to 42.5, while pass@256 keeps falling. RL is lifting the average while the solvable set keeps shrinking.

The problems RLVR solves are a subset of base’s

Why the overtake? The authors partition AIME24 and MATH500 questions into four buckets by whether base and RL can solve them. On AIME24 (k=1024), base-only is 13.3% while RL-only is 0%, so the solvable set under RL is essentially a strict subset of base. On MATH500 (k=128) the numbers are smaller but the shape is the same: base-only 3.6%, RL-only 1.0%, and the few RL-only problems are all sampled by base at k=1024.

The per-question accuracy distribution tells the same story. On Minerva, near-1.0 accuracy questions become more common, and the 0.1, 0.2 bucket shrinks, consistent with “push the almost-solvable problems all the way to solved.” But the 0-accuracy bucket also grows. So there is a class of problems base could occasionally stumble onto that RLVR completely loses, which corroborates the coverage shrinkage seen in pass@k rather than being a sampling artifact.

Same-distribution evidence in perplexity

To nail down “RLVR did not introduce new paths,” one needs to check that the paths themselves still live in base’s distribution. The authors compute perplexity under the base model on three sources: base’s own generations $Y_{\text{Base}}$, RL’s generations $Y_{\text{RL}}$, and OpenAI-o1’s generations $Y_{\text{GT}}$.

The key observation is that the distribution of $\text{PPL}_{\text{Base}}(Y_{\text{RL}})$ almost overlaps the lower tail of $\text{PPL}_{\text{Base}}(Y_{\text{Base}})$ (the portion base would already generate with relatively high probability), and is clearly lower than $\text{PPL}_{\text{Base}}(Y_{\text{GT}})$. In other words, RL’s answers are exactly the kind of answer the base model would assign high probability to on its own; o1’s answers are foreign to base. Tracking through training, $\text{PPL}_{\text{Base}}(Y_{\text{RL}})$ falls monotonically with steps, meaning RL keeps concentrating distribution mass onto the high-probability region of base’s prior.

The manual CoT inspection provides independent corroboration. On GSM8K’s sub-5% accuracy questions, base got 24 of 25 correct with at least one valid CoT, RL got 23 of 25 correct with at least one valid CoT. Base is not guessing. On AIME24, base produces long CoTs with self-reflection in 2048 samples; the reflective behavior was already in base, not taught by RL.

Sampling efficiency gap is nearly identical across RL algorithms

If RLVR varied by orders of magnitude across algorithms, “no capability expansion” could be blamed on specific algorithms. The authors use the VeRL framework to reproduce PPO, GRPO, Reinforce++, RLOO, ReMax, DAPO on equal footing, evaluated on Omni-MATH-Rule’s train, in-domain, and out-of-domain splits.

To quantify “how far from the ceiling,” the authors define a metric called the sampling efficiency gap (SE for short): $\text{SE} = \text{base’s pass@256} - \text{RL’s pass@1}$, treating base pass@256 as the reachable ceiling, with smaller gap being better. The six algorithms have in-domain SE ranging from GRPO’s 43.9 to RLOO’s 42.6, a spread of just over one point, negligible compared to the 40+ point gap to the ceiling. Out-of-domain SE is around 20, smaller in absolute terms but the same pattern. The conclusion is that the RLVR bottleneck is at the paradigm level, not the algorithm level, and cannot be fixed by swapping losses or baselines.

KL coefficient and rollout count ablations point the same way. Adding KL penalty (coefficient 0.001) leaves pass@1 flat but visibly lowers pass@128. Raising rollouts from 8 to 32 marginally improves pass@k but still gets overtaken by base. Raising the RL model’s sampling temperature until its output entropy matches base at T=0.6 makes pass@k slightly better than RL’s own number but still below base, meaning entropy collapse is one component of coverage shrinkage but not the whole story.

Distillation can expand the ceiling, RL cannot

The paper deliberately pulls in distillation for contrast. DeepSeek-R1-Distill-Qwen-7B distills R1’s long-CoT data into Qwen2.5-Math-7B, compared against base, Instruct, and Oat-Zero RL. The distilled model’s pass@k curve sits clearly above base across the entire range of k, including large k, while the RL version only leads at small k. So distillation can inject reasoning patterns from a stronger teacher that base does not have, genuinely expanding the solvable set; RL cannot.

The conclusion still holds on near-frontier models

Does the conclusion survive at scale? The authors use Magistral-Medium, a near-frontier pure-RL model (base is Mistral-Medium-3), on AIME24 and AIME25. At k=1, RL solves about 7~8 more problems than base, but the gap keeps shrinking as k grows. The conclusion holds on the strongest known open RL model.

Why it gets capped by base

The authors attribute the root cause to two structural differences between RLVR and traditional RL (AlphaGo, Atari). First, the action space in language is orders of magnitude larger than in Go, and from scratch it is nearly impossible to reach a path with positive reward. Second, precisely because of this, RLVR must start from a pretrained base, whose prior keeps most samples in the “reasonable answer” region; otherwise reward signal is uniformly negative and learning never gets off the ground.

The prior is a double-edged sword. Policy gradient amplifies samples in the prior that earn positive reward and suppresses samples that earn negative reward, so the entire optimization trajectory stays framed by the prior. Any sample noticeably off-prior most likely gets negative reward and gets pushed further down, and the resulting policy only contracts within base’s existing distribution. This explains the “distribution sharpening but not expansion” seen in the perplexity experiment.

The authors suggest the way out lies in higher-level exploration mechanisms (program-level evolution like AlphaEvolve), curriculum coverage of training data to ensure meta-skill transitions, and finer process reward with credit assignment. The paper flags multi-turn agent RL as another worth-exploring path, on the grounds that IMO-level reasoning requires feedback-driven iterative correction, not single-turn output. These are open directions, not claimed solutions.

Experimental setup and a few limitations

Default temperature is 0.6, top-p 0.95, max generation 16k tokens. Base models are evaluated zero-shot with the same prompt as RL models to avoid extra reasoning leaked in through in-context examples. The paper notes base often produces ill-formatted answers under zero-shot, but enough sampling finds properly formatted correct solutions.

Math tasks use zero-RL (RL straight on base, no SFT); code and visual reasoning start from instruction-tuned models per community convention for training stability. All comparisons are “training starting point vs RL model” to isolate RLVR’s own effect.

The two limitations the paper gives: many of the strongest RL models and training pipelines are closed-source and cannot be included; RL for LLM is evolving fast and a future RL paradigm may invalidate these conclusions.

Summary

The paper does three things. First, a diagnostic tool: switching from average score to pass@k at large k turns “how many problems can the model actually solve” into something measurable. Second, the phenomenon: across families, algorithms, tasks, and sizes, base consistently overtakes RL at large k, and the RL solvable set is nearly a subset of base’s. Third, the mechanism: perplexity analysis shows RL outputs already live in base’s distribution; the controlled comparison shows distillation can push the solvable set beyond base while RL cannot; the root cause is the double-edged sword of a pretrained prior in a vast action space. The conclusion is not “RLVR is useless,” but “what RLVR currently does is sample base’s existing paths more reliably; genuinely expanding the solvable set requires a new paradigm.”