MAI-Thinking-1: Pre-training Data Processing and Mixture Optimization
MAI-Thinking-1 (MAI-Thinking-1: Building a Hill-Climbing Machine, 2026, Microsoft AI) is a reasoning model trained from scratch by Microsoft, with a 35B active / 1T total parameter MoE architecture, pre-trained on 30T tokens. The data section of this technical report is remarkably thorough, covering collection, cleaning, mixture optimization, and mid-training data strategy across nearly every decision worth documenting in a full pre-training data pipeline.
Three design principles run through the entire report:
- Capabilities should be learned, not inherited: no distillation, since imitated capabilities lack the steerability and robustness needed for long RL climbs.
- Simplicity is sustainable: simple, scalable recipes; clean, trustworthy data; transparent infrastructure.
- Scientific rigor avoids shortcuts: every decision must be validated through scaling ladders, ablations, and evaluations.
This post focuses on the data collection, cleaning, mixture selection, and mid-training data strategy for the pre-training base model MAI-Base-1.