AI data commons
AI model collapse and the need for high-quality training data
The first book to form a narrative about an ideal alternate society, Utopia by Sir Thomas More, was published around 500 years ago. In Utopia there is no greed, corruption, or power struggles as there is no money or private property. There are few laws, people live communally and sustain themselves by cultivating and making what they need.
Utopia is a place of productivity and lifelong learning. Labour is reduced and everyone learns a trade of their choice. Once that is mastered people may learn a new trade if they wish. “They learn it as though it were a game, not just by observation.” (More 1555) There are even public lectures before daybreak that workers attend as part of their leisure time.
More’s utopia was a critique of 16th century Europe through satire. It’s more likely to be seen as a dystopia with contemporary eyes. But the point of utopian thinking is to compel us to think further than our current reality, which it continues to do.
Critique aside, one aspect of More’s utopia that has fresh relevance today is the notion of commons. But not in terms of material property. Rather, a new type of intangible resource that increasingly permeates and influences our everyday lives. Data.
Abundant quality data freely available on the web has been a communal resource, but generative AI is changing the landscape.
Tina
★
AI data commons
Recent studies have revealed two imminent challenges in generative AI development. The risk of model collapse when AI models are trained on AI-generated content, and the declining amount of freely accessible high-quality AI training data.
The development of popular generative AI technologies so far has relied on vast amounts of human generated data – such as text and images – sourced from content rich public websites including media publications and forums. But since the widespread release of generative AI capabilities, content generated by AI has been populating the web. At this stage there’s no way of knowing the amount of content online that is AI generated, or to reliably check if content is created by humans or AI. This creates a problem for future AI model development.
A recent study published in Nature has found that when a text-based generative AI system is heavily trained on AI-generated content, it produces nonsense after just a few cycles. AI-generated content narrows the knowledge base through omitting content distribution, such as rare events, unique perspectives or nuances in understandings. Models instead churn out likely sequences from the training data while injecting their own unlikely ones, warping performance, propagating bias and causing defective outputs. What the researchers called model collapse. Although the research focused on text, the issue could also impact AI models that produce images or videos.
The risk of model collapse is compounded by the shrinking of data commons. Until recently data on the world wide web has been a kind of communal asset. By choice or by accident, people posting public content online contributed to a burgeoning pool of data, free to use by anyone with a computer and the means to collect it. Mammoth datasets used to train the most successful generative AI models to-date were created in this way.
Model collapse - from Shumailov et al 2024
However, research recently published by the Data Provenance Initiative highlights that many important web sources used for training AI models have restricted the use of their data over the past year. The researchers estimate that in three commonly used AI training datasets (C4, RefinedWeb, and Dolma), 5% of all data and 25% of data from the highest-quality sources have been restricted.
Because AI companies used data from content creators and organisations without consent, there has been a wave of legal action and opt out requests built into websites by publishers (although ‘do not crawl’ instructions are non-binding and ignored by unscrupulous actors). Whilst some major publishers have made deals with big tech companies to allow access to their human generated content, these agreements are out of reach for researchers and non-profits involved in AI development or scrutiny.
Creating safe, responsible, diverse uses of AI requires experimentation and progress to be made by a range of AI ecosystem stakeholders, not just the biggest technology companies. Whilst some believe AI models can be trained on synthetic data - quality data generated by AI - this method is unproven.
To address this consent crisis and to counter monopolistic barriers to AI development capabilities we need broadly accessible, high quality datasets that can be utilised by non-commercial and emerging service providers, better mechanisms enable content creators the ability to discern who can use their public domain data and for what purposes, and to find ways to trace and prove content sources online in our increasingly AI-saturated world.
References