Major websites and data providers are now charging AI companies between $5 million [1] and $250 million [1] for training datasets.

This shift marks a fundamental change in how the internet is indexed. For years, AI developers relied on free web crawling to build large language models, but the owners of high-quality data are now treating their archives as premium assets.

Websites are increasingly blocking AI crawlers to force developers into licensing agreements. This move transforms the economics of the web, turning public-facing content into a gated commodity for those building the next generation of chatbots.

One prominent example is Reddit, which secured an API licensing deal valued at $60 million [1]. While corporate deals reach these heights, the value of individual contributions remains low. Personal data is estimated to be worth only a few dollars [1] per person.

"From Reddit's $60 million API deal to personal data worth 'a few dollars,' the new economics of AI training data are taking shape," a Yahoo Finance summary said [1].

The trend highlights a growing divide between the massive companies capable of paying millions for data and smaller developers who may lose access to the open web. As the supply of high-quality, human-generated text becomes limited, the cost of acquisition is expected to remain high.

The price of AI training data is reported to range from $5 million to $250 million per dataset.

The transition from an open-web crawling model to a paid licensing regime creates a significant financial barrier to entry for AI development. By monetizing their data, platforms like Reddit are asserting ownership over the intellectual property used to train AI, which could lead to a concentration of AI power among the wealthiest firms that can afford these multi-million dollar datasets.