MAI-Thinking-1: Pre-training Data Processing and Mixture Optimization

Publish on: 2026-06-17 Classify at: NLP Read:≈ 10min Views: Comments:

MAI-Thinking-1 (MAI-Thinking-1: Building a Hill-Climbing Machine, 2026, Microsoft AI) is a reasoning model trained from scratch by Microsoft, with a 35B active / 1T total parameter MoE architecture, pre-trained on 30T tokens. The data section of this technical report is remarkably thorough, covering collection, cleaning, mixture optimization, and mid-training data strategy across nearly every decision worth documenting in a full pre-training data pipeline.

Three design principles run through the entire report:

Capabilities should be learned, not inherited: no distillation, since imitated capabilities lack the steerability and robustness needed for long RL climbs.
Simplicity is sustainable: simple, scalable recipes; clean, trustworthy data; transparent infrastructure.
Scientific rigor avoids shortcuts: every decision must be validated through scaling ladders, ablations, and evaluations.

This post focuses on the data collection, cleaning, mixture selection, and mid-training data strategy for the pre-training base model MAI-Base-1.

Data Sources and HTML Extraction

The 30T pre-training tokens come entirely from publicly available and licensed human-generated data: web HTML, web PDFs, public GitHub code, books and journals, and third-party commercially acquired datasets. No LM-generated synthetic data is used, and AI-generated content within collected sources is actively detected and removed. No open-source training datasets are used either, with huggingface.co and mirror sites excluded entirely. Their proprietary crawler respects robots.txt and excludes sources that violate Microsoft Responsible AI policies or appear on the USTR Notorious Markets list.

Knowledge cutoff dates vary by source: Web HTML September 2025, Web PDF December 2025, GitHub June 2025, books and journals March 2026.

The central challenge in HTML extraction is that no single method handles all web pages well. Math and code are represented inconsistently across the web, and many off-the-shelf extraction tools drop formulas, code blocks, and tables, retaining only the surrounding plain text. The report lists four strategies, applied per domain:

Source-specific structured parsers (HTML/XML schema-aware)
Hand-crafted extractors (BeautifulSoup, for domains with consistent structure that general heuristics handle poorly)
LLM/agent-based extraction (hard constraint: can only keep or remove original text, cannot add synthetic content)
Training on raw content directly

The fourth strategy has a notable example: Wikipedia is trained directly on wikitext markup, which is roughly 3x more verbose than rendered HTML, but elements like infoboxes are frequently corrupted by existing parsers. For a high-value, low-volume source like Wikipedia, spending 3x the token budget to preserve structural integrity is a worthwhile trade-off.

Five-Layer Deduplication

Deduplication at this scale directly affects model quality. The 35B active / 1T total parameter MoE with high sparsity has exceptional memorization capacity, and duplicate samples exacerbate overfitting, compressing the space available for generalization. Meanwhile, predictive scaling behavior is sensitive to the effective number of unique tokens: larger models show degraded scaling on low-novelty corpora because they exhaust the supply of learnable new information sooner.

The five deduplication layers target different granularities of redundancy:

Boilerplate removal: line-frequency statistics to strip headers, footers, and navigation bars
Exact duplicates: byte-level and hash-level exact matching
Fuzzy duplicates: MinHash LSH with a similarity threshold of 0.8
Templated web pages: pages generated from shared templates with minor lexical variation (e.g., “calculator” pages with raw arithmetic tables) are skeletonized to their most frequent tokens and fuzzy-deduplicated as clusters
Semantic duplication: Qwen3-Embedding-0.6B identifies semantically similar documents, retaining only a limited number of representatives per cluster. This type of redundancy is especially common in code, where canonical programming exercises like BST traversal recur across homework sets, exams, interviews, and competitions with diverse implementations but convergent solutions

Cross-dataset deduplication relies on a global drop-order: duplicates are retained only in the highest-priority dataset. Historically, overlap was managed through explicit filters (e.g., excluding the Wikipedia domain from web-crawl data), but as the number of sources and pipeline complexity grew, unintended overlaps became difficult to manage manually, prompting the shift to the drop-order mechanism. An easily overlooked side effect: modifying a single dataset can change which dataset retains a duplicate, so the drop-order effectively introduces a hidden variable. This needs to be accounted for when evaluating the marginal contribution of any individual dataset.

Filtering, Categorization, and Data Ablation

The goal of filtering and categorization is not merely to improve quality; more precisely, it is to convert heterogeneous raw corpora into structured data that mixture optimization algorithms can operate on. The aim is to enable subsequent controlled ablations along dimensions such as source family, quality tier, topic, educational level, language, and format.

The process has two steps: first, remove content unlikely to contribute positively (spam, policy-sensitive material, noise), then bucket the remainder. Bucketing dimensions include quality tiers, language groups, topics, educational value, educational level, source type, and domain-specific subcorpora.

The technical approach combines five classes of methods: metadata signals, source-specific heuristics, learned classifiers, prompted LLMs, and manual labeling.

After bucketing, every dataset undergoes ablation experiments. The report uses two approaches:

Single-source ablation upsamples the target data to 50% of the mixture and trains from scratch, quantifying the data source’s marginal utility across held-out NLL evaluations.

Scaling-ladder ablation ablates within the full mixture and uses a ladder of models with increasing parameter counts to predict performance at the target scale. Due to multi-epoch effects, data is downsampled to simulate the repetition patterns of the final training run.

The figure above shows a data mixing ablation example: 183 models trained from scratch at 3 scales across 61 Web HTML / Code / other mixtures. The solid lines mark the Pareto frontier along the Web HTML and Code validation loss dimensions. Points on the frontier correspond to mixtures where Web and Code dominate; points off the frontier reflect an outsized share of the “other” subset.

Mixture Selection

Given hundreds of heterogeneous data sources and a fixed compute budget, determining the weight for each source is the core mixture optimization problem. MAI uses a weighted NLL as the optimization target:

$$Target = 0.5 \times Coding + 0.175 \times STEM + 0.175 \times Math + 0.1 \times General + 0.05 \times Multilingual$$

Code and STEM/Math together account for 0.85 of the total weight, making the reasoning-first priority explicit.

The report identifies five challenges in mixture optimization: defining utility, vast search space, scale-dependent effects, cross-dataset interactions, and multi-epoch effects. Scale-dependent means that sources useful at small model sizes may not be useful at large ones; multi-epoch means that placing high weight on small, high-quality datasets causes them to be consumed repeatedly during long training horizons, leading to overfitting.

One class of solutions trains many small models (760M to 4B, i.e., L12 to L36 on the ladder) to build predictive models, with RegMix as a representative method. MAI also experimented with variants of this approach. These methods rely on a core assumption: that the relative performance ordering of two mixtures at small scale is preserved at large scale, known as rank invariance.

But MAI found in practice that this assumption does not always hold. They compared code-heavy-mix (approximately 50% code) and stem-heavy-mix (significantly upsampled STEM). At 5B scale, stem-heavy-mix outperformed on STEM evaluations, but after scaling to 23B active / 20T tokens, the two STEM evaluation curves crossed midway through training, with code-heavy-mix ultimately winning.

The cause was two data sources in stem-heavy-mix that were high-quality but had substantial fuzzy duplication and limited content diversity, accounting for 11.8% of stem-heavy-mix (versus just 0.3% in code-heavy-mix). Smaller models benefited from the repeated exposure, but larger models, having fully absorbed the repeated content, were dragged down by the lack of diversity.

This finding led MAI to stop relying solely on conclusions from a single scale, instead using the ladder approach to validate mixture scaling properties across multiple scales.

The final mixture was determined through hierarchical search: data was divided into approximately 10 categories, with alternating local search over within-category weights and global search over between-category weights. The maximum number of epochs for any single dataset was capped at 8. After selecting candidates, a scale-up validation using approximately 2.8x the compute of global mixing confirmed that the optimal candidate no longer changed with scale.

Source Family	Unique (T)	Training (T)	Share	Avg. Epochs
Code	7.4	16.4	54.6%	2.22
STEM	2.2	4.7	15.8%	2.17
Math	0.3	1.6	5.4%	5.28
Books	0.6	0.9	3.1%	1.65
PDFs	2.7	1.4	4.7%	0.53
Web text	8.1	4.5	14.9%	0.55
Multilingual	8.1	0.5	1.6%	0.06

Several extreme values illustrate the trade-offs in mixture strategy. Math has only about 300B unique tokens but is sampled an average of 5.28 times, the highest reuse rate of any source family. Web text and PDFs each have far more unique tokens than were actually consumed, with average epochs below 1, meaning the full available corpus of these sources was not exhausted even across a 30T-token run. Multilingual is the most aggressively downsampled: only 0.5T of the 8.1T available tokens were used.

The figure above shows the contribution of different data sources to a graduate-level physics NLL evaluation. PDF and web math/STEM data are positively correlated with physics NLL performance, general web is largely neutral, and increasing the code share does not improve physics performance. This type of visualization is one of the tools used during the global search phase to assess per-source contributions.

Mid-training Data

The mid-training phase consists of two steps (3.4T + 150B tokens), with one key design decision: all data comes entirely from the pre-training corpus, with no new or synthetic sources introduced. The only changes are filtering, re-weighting, and repacking at longer sequence lengths.

The overall mixture shifts further toward reasoning: STEM/Math increases to 35%, Code stays at 55%, and the remaining 10% goes to background sources. Within-category weights are tuned via single-source microanneals, optimizing the same NLL objective used during pre-training mixture search, augmented with long-context NLL tasks.

STEM reasoning data filtering introduces Bloom’s taxonomy. The report defines a “Bloom Analyze heuristic”: retain documents with high technical correctness, at least intermediate reasoning depth, and Bloom cognitive processing level at or above Analyze, while removing documents that contain only simple factual statements. This is more granular than pure quality scoring, distinguishing “technical documents with reasoning chains” from “correct but fact-listing-only documents.”

Code data undergoes two additional processing steps. First, file extension filtering on a per-quality-bin basis: HTML/CSS/SVG files are retained in top-tier repositories (where they typically belong to larger frontend projects) but removed from lower-tier ones (where they are predominantly low-quality standalone pages). Second, file-level document formatting is introduced alongside the existing repo-level format from pre-training, treating repository understanding and individual file understanding as complementary tasks.

Memorization-aware epoch capping is a mechanism specific to mid-training. It estimates which sources have been heavily memorized by the end of pre-training using a per-source validation proxy: the fraction of validation loss improvement between two checkpoints that comes from near-certainty token predictions (NLL < 0.01). A high fraction indicates that NLL reduction is primarily driven by memorization or highly repetitive structure rather than acquisition of new capabilities. Such sources receive stricter epoch caps during mid-training, while others are allowed more exposure.

Long context extension proceeds in two stages, expanding the context window from 16k to 64k and then to 256k. Both stages only repack data at the longer sequence length without modifying mixture weights. This minimizes distribution shift: changing weights and sequence length simultaneously would make it difficult to attribute performance changes to either factor. Repacking also has the benefit of reducing truncation of high-quality long documents.

Summary

The defining characteristic of MAI’s data pipeline is that every step is tied to evaluation: bucketing exists so that ablation experiments can operate along controlled dimensions, and mixture selection uses alternating local/global search with cross-scale validation rather than relying on conclusions fixed at a single small scale. The rank invariance counterexample is a finding with real practical value: high-quality data that helps at small model sizes can fail at larger ones due to insufficient diversity, and mixture decisions require cross-scale validation.