News

Reddit issues warning to AI companies scraping its platform

Robin-Leigh Chetty

26th June 2024

Reddit has announced that it plans to update its policies regarding data scraping of its platforms by bots.
In particular it plans to rate limit and block bots from companies it does not have an agreement with.
AI companies, however, have regularly been found to have bypass Reddit’s protections on data scraping.

Data is the gold that AI companies use to train their large language models (LLMs) and generative AI (genAI) platforms, which makes a site like Reddit a goldmine for scraping data that can help build more conversational and human-like systems.

While Reddit has agreements in place with several companies to facilitate data scraping, it seems like not all companies are playing by the rules. This has prompted the website to announce a forthcoming update to its Robot Exclusion Protocol, and in particular its robots.txt file.

For those unfamiliar with this file, it gives instructions from a website about how it does and does not allow crawling by third parties, with each varying depending on the policy of the site owner. It the case of Reddit, it is now set to change those rules to be stricter than before.

“Along with our updated robots.txt file, we will continue rate-limiting and/or blocking unknown bots and crawlers from accessing reddit.com. This update shouldn’t impact the vast majority of folks who use and enjoy Reddit. Good faith actors – like researchers and organizations such as the Internet Archive – will continue to have access to Reddit content for non-commercial use,” the company shared in a blog post.

“Anyone accessing Reddit content must abide by our policies, including those in place to protect redditors. We are selective about who we work with and trust with large-scale access to Reddit content,” it added.

While the Internet Archive was highlighted as a company that can do what it wishes on Reddit, it is unclear which companies have been doing so without the go ahead or approval of Reddit.

The platform has also been careful not to highlight any specific AI company either in terms of violating its scraping rules.

“This update isn’t meant to single any one entity out; it’s meant to protect Reddit while keeping the internet open,” an unnamed spokesperson told Engadget in a statement.

“In the next few weeks, we’ll be updating our robots.txt instructions to be as clear as possible: if you are using an automated agent to access Reddit, regardless of what type of company you are, you need to abide by our terms and policies, and you need to talk to us. We believe in the open internet, but we do not believe in the misuse of public content,” they added.

As the race to have better AI platforms continues to heat up, the data that these models are trained on will become of increasing importance, as will the way in which they gain access to said data for scraping.

About Author

Robin-Leigh Chetty

Editor of Hypertext. Covers smartphones, IoT, 5G, cloud computing and a few things in between. Also a keen photographer and dabbles in console games when not taking the hatchet to stories.