“New age of internet censorship”: Reddit to block the Internet Archive from indexing its site. Here’s why it matters

https://www.dailydot.com/news/reddit-to-block-the-internet-archive-from-indexing-the-site/

Braden Bjella Aug 14, 2025 · 5 mins read
“New age of internet censorship”: Reddit to block the Internet Archive from indexing its site. Here’s why it matters
Share this

There’s an old saying that everything that goes on the internet, stays on the internet.

Featured Video

Of course, this is only true to a certain extent. According to a 2024 Pew study, one in four webpages that were online at some point between 2013 and 2023 are no longer accessible. For sites from before 2013, this problem is even more pronounced; the Pew study states that 38 percent of webpages that were accessible in 2013 are no longer available.

This is where services like the Internet Archive and its Wayback Machine come in. Described on the site as a “digital library of Internet sites and other cultural artifacts in digital form,” the Wayback Machine allows users to look at defunct websites and older versions of current-day sites. This is an invaluable tool for researchers, as it allows them to see information that is no longer online in addition to how and when sites and articles have been edited.

However, this tool is about to be slightly less effective, as Reddit recently announced that it would be blocking the service from indexing most of the site moving forward. The reason? A.I.

Advertisement

A History of Reddit Limiting Access

As reported by The Verge, Reddit will now block the Internet Archive from indexing many of the pages on the site. While the Wayback Machine will still be able to index the homepage, showing which threads on the site were the most popular at a given date and time, they will no longer allow the service to save individual threads.

The reason for this, the social media site says, is the rise of Artificial Intelligence and Large Language Models. 

In short, while Reddit used to allow free and open access to its API, it has slowly begun to implement fees to use its vast array of content. In 2023, the company announced that it would begin charging companies for developer access to its API, and in 2024, it began to charge search engines to index its content.

Advertisement

Why the sudden clampdown? Since ChatGPT debuted, there’s been a growing interest in the tech sector about Large Language Models — and, seeing as Reddit is a massive and constantly updating repository of naturalistic user-generated content in multiple languages, it’s become a great tool for harvesting data to train these LLMs.

Why is Reddit Blocking the Internet Archive from Indexing the Site?

Seeing that LLMs were using Reddit’s data, the site began to charge companies for use, striking a deal with OpenAI and Google to allow their LLMs to be trained on its data.

Advertisement

The site’s recent clampdown on the Internet Archive is claimed to be related to the use of this data. While companies are supposed to pay Reddit to access its broad swath of content, Reddit spokesperson Tim Rathschmidt claims that some companies are circumventing this by downloading the site from saved versions on the Internet Archive.

“Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,” Rathschmidt told The Verge.

However, this doesn’t appear to be the only reason. Rathschmidt added that “until [the Internet Archive is] able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors.”

Advertisement

These limitations will be implemented slowly, with the company saying that they will “inform [the Internet Archive] of the limits before they go into effect.” In response, Mark Graham, director of the Wayback Machine, said in a statement to The Verge that “We have a longstanding relationship with Reddit and continue to have ongoing discussions about this matter.”

Redditors React

On Reddit, a thread on the r/technology subreddit about this news quickly racked up over 30 thousand upvotes, with many claiming that stories like these showed how the days of a free and open internet were gradually coming to an end.

“Outrageous, especially with how often posts, threads and users get deleted,” wrote a user.

Advertisement

“New age of internet censorship,” declared a second, citing issues like the U.K.’s new age verification law

Others questioned whether Reddit was being truthful in their statements, claiming that “scraping” the Internet Archive would be a difficult and time-consuming process. Instead, they alleged other factors may be at play.

“It’s just bull****. The internet archive has pretty aggressive rate limiting, and the loading speed isn’t very fast in the first place,” said a commenter. “Scraping the Wayback machine isn’t exactly efficient. It’s just a false pretense to squeeze them for some money.”

“This makes zero sense. If anyone has used the Internet Archive, they will quickly realize how difficult it would be to scrape because it is so d***ed slow!” exclaimed another.

Advertisement

“Reddit can’t have people recording all of the admin/moderator manipulation. It ruins their platform’s credibility. And thus its cultural relevance and shareholder value,” suggested a third.

We’ve reached out to Reddit and the Internet Archive via email.

Advertisement