News

OpenAI accused again of unauthorised training for ChatGPT

Robin-Leigh Chetty

2nd April 2025

A research paper has been released alleging that OpenAI is training ChatGPT on data from non-licensed material.
In particular, it was found to have trained the chatbot on books that were behind a paywall.
It is the latest allegation made against the company, which has quickly run out of free public data to train its LLMs on.

OpenAI may be the leading generative AI platform in the market at the moment, but it is facing several problems. These include increasing competition from Chinese competitors who are able to offer a far more affordable product, along with simply running out of freely accessible training date for the LLMs that power offerings like ChatGPT.

The latter issue has seen several allegations being leveled against OpenAI in recent years, and the latest has evidence to back it up, with a research paper being published by the AI Disclosures Project, finding that OpenAI has been accessing content behind paywalls in order to train its models.

Naturally, it has no authorisation to do so, and has not licenced access to the content, which is what other companies have alleged the generative AI startup of doing in the past.

Much of the training that OpenAI, and other AI company’s models perform, comes from freely accessible data on the internet. It is what powers generative AI solutions like the recent viral Ghibli-style images that have flooded social media, but many of the company’s have reached a plateau in terms of the amount of free content that’s available online.

It is part of the reason why OpenAI is looking to push legislation in the United States that would allow it to circumvent any copyright law in order to train ChatGPT and other platforms that leverage LLMs. The startup has also shrewdly framed the reason why it needs to infringe copyright under the guise of promoting the US in the AI race.

“America’s robust, balanced intellectual property system has long been key to our global leadership on innovation. We propose a copyright strategy that would extend the system’s role into the Intelligence Age by protecting the rights and interests of content creators while also protecting America’s AI leadership and national security. The federal government can both secure Americans’ freedom to learn from AI, and avoid forfeiting our AI lead to the PRC by preserving American AI models’ ability to learn from copyrighted material,” OpenAI wrote in a blog post last month.

Shifting back to the most recent allegations, as TechCrunch points out, the paper was published by three researchers under the banner of the AI Disclosures Project. One of the three is Tim O’Reilly, who is the CEO of O’Reilly Media, which also happens to be one of the co-founders of the Project.

As for what prompted the research, it may have something to do with the fact that OpenAI’s GPT-4o model displayed, “strong recognition of paywalled O’Reilly book content”.

“GPT-4o exhibits far stronger recognition of non-public O’Reilly book content compared to publicly accessible samples, with AUROC scores of 82% (non-public) vs 64% (public). We would expect the opposite, since public data is more easily accessible and repeated across the internet. This highlights the value-add of paywalled high-quality data to a model’s training,” the researchers added.

Unsurprisngly, the research paper advocates for greater transparency on training data, as well as compensation for accessing such content.

“Our findings highlight the need for stronger accountability in AI companies model pre-training process. Liability provisions that incentivize improved corporate transparency in disclosing data provenance may be an important step to facilitating commercial markets for training data licensing and remuneration,” it continues.

Whether such a scenario can indeed be breached remains to be seen, but as LLMs hit a wall when it comes to the amount of data they can access for training, companies like OpenAI will turn to government to help it circumvent longstanding protections.

You can read the research in full, for yourself, here.

OpenAI wants permission to steal content

[Image – Photo by Levart_Photographer on Unsplash]

About Author

Robin-Leigh Chetty

Editor of Hypertext. Covers smartphones, IoT, 5G, cloud computing and a few things in between. Also a keen photographer and dabbles in console games when not taking the hatchet to stories.