News

We Are Running Out of Real World Data for AI Training: Experts

Artificial Intelligence experts, including Elon Musk, have claimed that the available real-world data for training AI models has largely been exhausted. Speaking during a live-streamed conversation with Stagwell chairman Mark Penn on X, Musk explained,

“We’ve now exhausted basically the cumulative sum of human knowledge in AI training. That happened basically last year.”

Musk’s comments align with statements from Ilya Sutskever, former OpenAI chief scientist, who handled the issue of “peak data” at the NeurIPS machine learning conference in December. Sutskever predicted that AI models would need to rely more heavily on synthetic data because of the scarcity of real-world data.

Rise of Synthetic Data

Musk emphasized synthetic data as the future of AI training. He explained that Artificial Intelligence models could generate training data themselves, describing this as a process of “self-learning” where AI “grades itself.” Several major tech companies have already adopted this approach.

Ad Powered By Advergic
  Loading ad . . . 
 Ad - Continue scrolling to read

Microsoft, Meta, OpenAI, Anthropic, and others have incorporated synthetic data into their AI development pipelines. For instance, Microsoft’s Phi-4 and Google’s Gemma models use synthetic data alongside real-world data, while Meta fine-tuned its Llama series and Anthropic developed Claude 3.5 Sonnet with similar methods.

Synthetic data offers significant cost advantages. Writer, an AI startup, claimed that its Palmyra X 004 model—built almost entirely using synthetic data—cost only $700,000 to develop. By comparison, a similarly-sized OpenAI model reportedly cost $4.6 million.

Challenges of Synthetic Data

Despite its benefits, synthetic data poses risks. Some research suggests it can lead to model collapse, where a model’s outputs become less creative and more biased. This occurs because synthetic data inherits limitations and biases from the models that generate it, potentially supporting those flaws over time. These risks underline the importance of balancing synthetic data with quality real-world data to maintain functionality and fairness in Artificial Intelligence systems.

As the Artificial Intelligence industry navigates this shift, experts continue to debate how to mitigate these challenges while leveraging synthetic data’s cost and scalability benefits.