News

AI development needs to be paused until the theft problem is sorted out

Brendyn Lotz

17th July 2024

A dataset used by Apple, Nvidia, Salesforce and Anthropic reportedly contains subtitles and transcripts from popular YouTube videos.
YouTuber Marques Brownlee and late night TV shows are among the channels which have had their data effectively stolen.
This is just the latest example of copyright being ignored in pursuit of AI development.

Another day, another accusation that a company developing artificial intelligence stole data to train its models. This time, the company in question is actually companies with Apple, Nvidia, Salesforce and Anthropic finding themselves on the wrong end of the AI discourse.

As discovered by Proof News in a remarkable investigation (which we highly recommend you read), subtitles and transcripts from 173 536 YouTube videos were used to train AI models. This data was reportedly scrapped from YouTube channels for late night TV as well as larger creators such as Marques Brownlee.

The dataset seen by Proof News was compiled by EleutherAI, a non-profit company that is focussed on aligning the multitude of AI models and giving independent researchers better insight into how AI models function. However, it appears as if EleutherAI may have been scrapping data it didn’t have permission to scrap.

“According to a research paper published by EleutherAI, the dataset is part of a compilation the nonprofit released called the Pile. The developers of the Pile included material from not just YouTube but also the European Parliament, English Wikipedia, and a trove of Enron Corporation employees’ emails that was released as part of a federal investigation into the firm,” writes Proof News.

The Pile (an awful name) was reportedly used by the aforementioned big tech firms to train their AI models. While the dataset doesn’t contain video footage it is comprised of video transcripts. That doesn’t make this any better though because, as Brownlee notes on X, many creators pay people to transcribe their videos for them.

Fun fact, I pay a service (by the minute) for more accurate transcriptions of my own videos, which I then upload to YouTube's back-end. So companies that scrape transcripts are stealing *paid* work in more than one way. Not great.
— Marques Brownlee (@MKBHD) July 16, 2024

Proof News’ report draws comment from a number of YouTube creators, each similarly shocked and frustrated that their content was being used in this manner.

Of course, this situation has become par for the course with AI companies who truly have embraced that adage that it’s easier to ask for forgiveness than it is to ask for permission. Copyright owners have launched multiple lawsuits against AI companies including OpenAI and multiple others. The most recent lawsuit sees the Recording Industry Association of America suing Udio and Sona for alleged copyright infringement.

However, no lawsuit has played out fully just yet so no precedent has been set. This means that AI is still very much a lawless Wild West where anything goes.

Many AI creators argue that they are operating under fair use protections and that the data was “publicly available” but it’s unclear whether these claims would hold up to legal scrutiny. From a moral and ethical viewpoint however, there is an argument to be made that scrapping the internet for content with a view to ultimately profiting off of that content to the tune of billions, is wrong.

But AI company’s don’t seem to care. Microsoft AI boss Mustafa Suleyman seems to believe that nothing posted to the internet is protected by copyright. OpenAI admits that without content theft, it’s platform would be nothing more than an interesting experiment and there are many other examples of AI firms using content they don’t have permission to use for training. The worst however is when companies are cagey about how their new wonderous product was trained. Last year OpenAI’s chief technology officer Mira Murati couldn’t disclose what data was used to train the company’s Sora platform.

While we’re all for technological progress, it has become increasingly apparent that companies which would be quick to launch legal battles against those infringing on their copyright, are less concerned with infringing on the copyright of others.

Maybe then we should hit pause on AI development until the mountain of legal challenges against companies developing the tech have been wrapped up. Until then, AI will just be an interesting experiment in copyright infringement and theft.

About Author

Brendyn Lotz

Brendyn Lotz writes news, reviews, and opinion pieces for Hypertext. His interests include SMEs, innovation on the African continent, cybersecurity, blockchain, games, geek culture and YouTube.