Thoughts on Training Data Depletion and Quality Degradation

There used to be a popular claim: human-generated data on the internet would be exhausted within a few years, and LLM training data was running out. Epoch AI predicted in 2022 that high-quality text data might be depleted around 2026. The discussion was intense at the time, but it seems to come up less often now.

The internet is indeed flooded with AI-generated content. Search for almost anything and the first few results likely include AI-written text. According to earlier fears, this AI-generated text would be crawled back as training data, models eating their own output, getting worse with each iteration — the so-called model collapse. A Nature paper rigorously demonstrated this: recursively training on a model’s own output causes tail distributions to gradually vanish, making outputs increasingly homogeneous.

But what’s actually happening diverges from this narrative in a key way: not all content enters the training pipeline. Search engines and RAG systems are using the strongest LLMs to score and filter content quality. Starting with Google’s March 2024 update , low-quality AI content has been systematically downranked — Google’s stated figure is a 45% reduction in low-quality and unoriginal content in search results. Other search engines are doing similar deduplication and quality checks. AI-generated content is indeed increasing, but the data that enters training pipelines has been rigorously filtered.

As a result, the internet and LLMs have already formed a human-in-the-loop cycle. Humans write content, models learn; models generate content, humans and stronger models together filter what’s worth keeping, and what survives enters the next round of training. Structurally, this process resembles RLHF, except the feedback signal comes not from annotator preference scores but from search engine rankings, user clicks and dwell time, and quality filtering models.

A linguistic analogy helps here. Language itself has always been evolving — every generation invents new words, new usages, new grammatical variants. By the same panic logic, language should have degenerated into nothing but slang and abbreviations long ago. Why hasn’t this happened? Because the contexts in which language is used have built-in selection mechanisms: everyday conversation, formal writing, news reporting — these contexts continuously exert selective pressure, ensuring the continuation of core expressive capabilities. The spread of AI-generated content on the internet faces similar selective pressure.

This doesn’t mean problems don’t exist. Low-quality AI content polluting long-tail queries is real, and information quality in certain vertical domains is genuinely declining. But the causal chain of “AI content floods the internet causing model training to collapse” doesn’t hold up, in my view.

So the entire ecosystem looks more like a spontaneous large-scale reinforcement learning loop. In a sense, with AI capabilities augmenting human output, the efficiency of producing high-quality data has dramatically increased. At this point, the question becomes analogous to whether the steam engine and AI created fewer jobs or increased demand.

As for whether high-quality data is hand-crafted by humans, AI-assisted, or purely AI-generated — it doesn’t seem to matter that much. Black cat, white cat — if it efficiently expresses and conveys information, it’s a good cat.

Perhaps those worried about training data running out have underestimated both how fast filtering mechanisms evolve and humanity’s capacity to keep producing quality content.