[ICLR 2025] RegMix: Data Mixture as Regression

Publish on: 2026-06-16 Classify at: NLP Read:≈ 8min Views: Comments:

In LLM training, the pretraining data mixture used to be treated as core know-how, set by gut feel: upsample Wikipedia because it’s clean, downweight Common Crawl because it’s noisy. Once data sources grow from a handful to a few hundred and total tokens cross the trillion line, that approach breaks down. Most existing automatic methods (DoReMi, DoGE, Online Data Mixing, etc.) train a not-so-small proxy model and adjust weights based on training dynamics, with the proxy itself often burning hundreds of billions of tokens.

RegMix (Data Mixture as Regression for Language Model Pre-training, ICLR 2025) proposes a different route: train a few hundred 1M-parameter proxy models, each on a random mixture; treat (mixture, validation loss) as regression data, fit a LightGBM, then search the mixture space for the optimum. 512 1M models trained on 1B tokens each consume roughly 2% the FLOPs of a single 1B model, yet they pick the best mixture out of 64 candidate 1B/25B-token models accurately. Pushed to 7B/100B tokens, the chosen mixture beats the Pile’s hand-tuned mixture by about 2 points averaged across 13 downstream tasks. On the Pile, the single-task gap between mixtures can reach 14.6 points, which the paper uses to make the case that picking the wrong mixture is expensive.

Paper: arxiv.org/abs/2407.01492

The core assumption: rank invariance

The whole pipeline rests on one empirical assumption: the relative ranking of mixtures stays stable as model size and training tokens grow. The paper calls this the rank invariance hypothesis. Put another way, if mixture A beats mixture B on a 1M model, A most likely still beats B on a 1B or larger model.

This assumption is worth thinking about on its own. When you use a small model as a stand-in for a large one, methods differ in how much they require the small-model observations to carry over. The strictest demand is that loss values extrapolate directly: fit a scaling law on N → L and predict large-model loss from the curve. RegMix asks for something weaker: the relative ranking among mixtures should not change. If A beats B on small models and still beats B on large ones, that is enough; how much smaller the loss is does not matter.

The paper quantifies ranking stability with the Spearman rank correlation $\rho$. The setup: on 17 Pile subsets (the original Pile has 22, the rest were dropped for copyright reasons), sample a few hundred mixtures at random, train 512 1M-parameter proxies on 1B tokens each, and record each one’s Pile-CC loss. Use mixture as input and loss as output to train a LightGBM. Then sample another set of unseen mixtures, train 1M, 60M, and 1B models on each, and compute the Spearman correlation between the actual ranking and the LightGBM-predicted ranking:

Test model	Linear $\rho$	LightGBM $\rho$
1M / 1B tokens	90.08	98.45
60M / 1B tokens	89.26	98.64
1B / 25B tokens	88.01	97.12

A LightGBM fitted on 1M models predicts the ranking of 1B/25B-token models with 97% Spearman correlation. The model grew 1000×, tokens grew 25×, and the prediction still holds. This is the most direct evidence for the rank invariance assumption and what makes the whole pipeline viable. The supplementary experiments extend this table to the 280M / 5B-token midpoint, showing that no pair of cells in the matrix drops below 0.9.

The RegMix pipeline

Four steps:

Sample mixtures: draw a few hundred mixtures from a Dirichlet distribution. The $\alpha$ parameter is “the normalized vector of available tokens per domain, multiplied by a scaling factor between 0.1 and 5.0.” This covers both extreme-sparse mixtures (one domain dominant) and near-uniform ones, while keeping the expected mixture aligned with actual token availability so that something like “give a domain with 1% of the tokens a 50% weight” stays improbable.

Train proxies: train a 1M-parameter TinyLlama on each mixture for 1B tokens. All 512 models can run in parallel.

Fit the regression: use the mixture vector as input features and the proxy’s validation loss on a chosen target domain as the label. The paper compares ridge regression and LightGBM; on 1M / 1B-token prediction, LightGBM pushes Spearman $\rho$ from 90.08 to 98.45. The choice of target determines which distribution the large model will lean toward, discussed below.

Simulate and train the large model: run a large-scale simulation in the mixture space using the fitted regression (a million candidates take under 10 CPU seconds), average the top-100 predicted mixtures, and use that as the final mixture for large-scale training.

Choosing the target domain

Which domain’s loss you pick as the regression target directly determines the final mixture. Pushing down all domain losses at once turns into a multi-objective fight. The paper settles on Pile-CC (a Common Crawl subset) validation loss for one reason: across 64 1B/25B-token models and 14 downstream tasks, Pile-CC validation loss has the highest negative correlation with downstream performance among all domains, reaching close to 1.0 on HellaSwag. Wikipedia’s correlation is noticeably weaker, despite Wikipedia being treated as the “high quality” benchmark domain throughout the GPT-3 era.

This lines up with what some recent work has found: a diverse web mixture reflects overall model ability better than a single high-quality encyclopedia. The paper takes Pile-CC down to the URL level, and on C4-100Domain finds that more than 85% of URL domains correlate strongly with overall Pile-CC downstream behavior, suggesting Pile-CC’s strong correlation is not a coincidence but a consequence of its topical breadth.

In practice RegMix assigns Pile-CC a weight of about 0.87, with the remaining 16 domains splitting the other 13%. The paper runs a direct comparison: training on Pile-CC only, RegMix’s automatic mixture, the hand-tuned Pile mixture, PPL filtering, ODM, and DoReMi. Pile-CC Only averages 46.8, RegMix 47.3, with Pile-CC Only already close; DoReMi also reaches 46.8 but takes more than 10× the FLOPs (3.7e19 vs 3.5e18). So that 13% of non-Pile-CC mass in the mixture still buys about 0.5 points.

To check whether RegMix works in out-of-distribution settings, the paper runs another experiment that drops Pile-CC entirely from training and uses the remaining 16 domains to optimize Pile-CC validation loss. RegMix still beats every baseline in this out-of-distribution setup.

Number of proxies vs tokens per proxy

Under a fixed FLOPs budget, do you train more proxies or train each one for more tokens? The paper sweeps this at fine granularity within 1B tokens:

The x-axis is tokens per proxy; the three curves are 64/128/512 proxies. The finding is sharp: past about 0.25B tokens per proxy, more tokens stop boosting rank correlation, while more proxies (64 → 512) keep helping. The counterintuitive comparison: 512 proxies at 0.2B tokens each beat 128 proxies at 0.8B tokens each, even though the total FLOPs are similar.

The two directions hit different bottlenecks. Giving a proxy more tokens makes its loss estimate more precise; but RegMix only uses rankings, so however finely you sharpen the loss, the ranking doesn’t change. Sampling another mixture and training a new proxy adds another training point for the regression, and the mixture space is dozen-dimensional, so more samples make the LightGBM fit better. That’s why more proxies keep paying off. This directly shapes the practical recipe: you can parallelize more aggressively and cut wall-clock.

The paper also compares 512 1M proxies against 128 1B proxies (orders of magnitude more FLOPs). The resulting 7B models tie on 13 tasks (56.5 vs 56.4 average), so the paper recommends starting with ultra-small 1M-scale proxies.

Cross-domain interactions

Replace LightGBM with linear regression and you lose some accuracy, but each regression coefficient gets a clean interpretation: bump the weight of the $i$-th training domain a bit, and the $j$-th validation domain’s loss goes up or down depending on the sign. The paper visualizes the full coefficient matrix as a heatmap and finds plenty of results that contradict intuition. PhilPapers is one striking example: increasing its weight pushes down loss on every other domain. That’s hard to spot by hand, since PhilPapers is just a small philosophy-papers subset of the Pile.

More broadly, the scatter plots of the 1M training logs show that apart from DM Mathematics, whose distribution is unusually isolated and whose weight-vs-loss relationship is roughly log-log linear, most domains don’t trace clean curves: more weight does not always mean less loss, and different mixtures produce non-monotonic jumps. That is what the title’s “transcend scaling laws” actually means: a single-domain power-law form is not enough to describe the joint effect of a mixture; you need all domains as joint input.

Downstream results

Applying RegMix’s chosen mixture at 1B/25B tokens and 7B/100B tokens:

At 1B/25B tokens, the 14-benchmark average is 47.3, against 45.1 (Pile hand-tuned), 46.8 (DoReMi), 46.2 (PPL filtering), 45.0 (ODM). On HellaSwag specifically, RegMix beats the hand-tuned mixture by 6.8 points.

At 7B/100B tokens, the 13-benchmark average is 56.5 vs the hand-tuned 54.5, a ~2-point gap. On tasks that scale visibly with tokens (HellaSwag, PiQA, Lambada, etc.), RegMix reaches the hand-tuned mixture’s full 100B-token score by the 25B-50B mark; the paper reports about a 2× speedup on most benchmarks and close to 4× on PiQA. On tasks that don’t track token count, like MultiRC, no mixture pulls clearly ahead.

Limitations

The rank invariance assumption is only validated up to 1B parameters. Pushing to 3B or above would require training 64 models on 50B tokens each to get a statistically meaningful correlation check, equivalent to training one 3B model on 3.2T tokens, which is beyond the authors’ compute. Microsoft’s MAI-Thinking-1 report gives a 23B-scale counterexample: stem-heavy-mix and code-heavy-mix curves cross midway through a 20T-token run, showing this assumption doesn’t always hold at scale.

Domains must be known. If the data has no clear domain label, RegMix can’t be applied directly. This is most awkward on web data, where there is no native domain tag; FineWeb’s URL-based domain definition is essentially a workaround.

The proxy and the large model must share a tokenizer. Once you swap tokenizers, domain weights don’t transfer cleanly, which makes reusing mixtures across projects messy.

Infinite-data assumption. RegMix implicitly assumes every domain has unlimited tokens, which is what lets it assign a 0.87 weight to Pile-CC. At MAI-Thinking-1-scale runs, most domains exhaust their available tokens before 0.5 epoch, and the actual mixture ceiling is bounded by token availability.