01What model collapse is

AI language models are trained on large datasets of text. During training, the model learns statistical patterns from this data. If the training data is high-quality and diverse, the model learns accurate and varied representations of language, knowledge, and reasoning.

Model collapse describes what happens when a model is trained on data that includes a significant proportion of AI-generated text. The AI-generated text itself reflects the patterns and limitations of the models that generated it. When a new model is trained on this data, it learns not just the genuine underlying patterns but also the biases, errors, and characteristic limitations of earlier AI models. Trained on its own outputs, successive generations of models become less accurate and less diverse.

A 2024 Oxford and Toronto paper demonstrated this experimentally: models trained iteratively on their own outputs showed significant performance degradation within a small number of generations. The researchers compared it to photocopying a photocopy: each generation introduces noise that compounds.

02Why it matters for enterprise AI strategy

Model collapse is primarily a concern for AI model developers, not for organisations using models via API. The major AI providers are actively working to ensure their training pipelines maintain data quality and do not inadvertently train on AI-generated content without appropriate filtering.

But model collapse has implications for enterprise knowledge management practices. Organisations that use AI to generate significant volumes of internal content (policy documents, knowledge base articles, training materials, reports) and then use that content as the grounding data for internal AI systems (via RAG or fine-tuning) are potentially creating a model collapse cycle within their own organisation.

If an AI writes your knowledge base articles, and those articles are then fed back to an AI to answer employee questions, and that AI's answers are then used to update the knowledge base, you are building a closed loop of AI-generated content that will degrade in quality over time as errors compound and genuine human knowledge recedes.

03Preserving human knowledge in an AI-augmented organisation

The model collapse risk points to a broader governance principle: human knowledge and human review must remain active components of knowledge management systems in organisations using AI.

This means: AI-generated content should be reviewed and validated by human subject matter experts before being published to knowledge bases that will be used as AI grounding data. The provenance of knowledge base content should be tracked so that organisations can identify what is human-generated versus AI-generated. Knowledge bases should be audited regularly to ensure they remain accurate and that errors introduced by AI have not propagated.

This is not an argument against using AI for knowledge management; the productivity benefits are real. It is an argument for maintaining human oversight as an active quality control layer rather than automating that layer away.

04The longer-horizon risk

Model collapse is a sign of a broader risk: as AI-generated content becomes more prevalent, the quality of the underlying human-generated data on which AI capabilities depend may degrade. This is a systemic risk for AI development, not just for individual organisations.

For boards, the relevant question is whether their organisation's content governance practices maintain the integrity of the human knowledge that makes AI useful, or whether they are inadvertently contributing to a degradation cycle. Policies on AI content labelling, human review of AI-generated material before publication, and audit of knowledge bases used for AI grounding are governance mechanisms that address this risk directly.

Key Takeaways

1.Model collapse is the degradation that occurs when AI models are trained on AI-generated data, compounding errors and reducing output diversity across successive generations.
2.For enterprises, the immediate risk is creating internal model collapse cycles: using AI to generate knowledge base content that is then used as grounding data for other AI systems.
3.Human review and validation of AI-generated content before it enters knowledge bases is the primary governance control.
4.Content provenance tracking (human-generated vs AI-generated) and regular knowledge base audits are practical governance mechanisms.
5.The longer-horizon systemic risk is degradation of the human-generated data on which AI quality depends as AI-generated content becomes a larger proportion of available text.

References & Further Reading

[1]
Shumailov et al.: The Curse of Recursion: Training on Generated Data Makes Models ForgetarXiv

Want to discuss this with an expert?

Book a strategy call to explore how these insights apply to your organisation.

Book a Strategy Call