Reddit is now blocking the Internet Archive (IA) from indexing popular Reddit threads after allegedly catching sneaky AI firms—restricted from scraping Reddit—instead simply scraping data from IA's archived content.
Where before IA's Wayback Machine dependably archived Reddit pages, profiles, and comments—as part of its mission to archive the Internet—moving forward, only screenshots of the Reddit homepage will be archived. As The Verge noted, this means the archive will only be useful as a snapshot of popular posts and news headlines each day, rather than providing a backup documenting deleted posts or a window into various Reddit subcultures or any given user's activity.
Reddit has not confirmed which AI firms were scraping its data from the Wayback Machine. The company's spokesperson, Tim Rathschmidt, would only confirm to Ars that Reddit has become "aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine."
Rathschmidt suggested there may be steps that IA could take to better defend against the AI scraping of archived Reddit content. That could perhaps lead Reddit to lift the restrictions on its scraping, which The Verge reported will be ramping up across Reddit starting today.
But Reddit also is taking this time to address other apparently longstanding privacy concerns, adding that restrictions are appropriate since the Wayback Machine problematically archives content that users have deleted.
"Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors," Rathschmidt said.
A review of social media comments suggests that in the past, some Redditors have used the Wayback Machine to research deleted comments or threads. Those commenters noted that myriad other tools exist for surfacing deleted posts or researching a user's activity, with some suggesting that the Wayback Machine was maybe not the easiest platform to navigate for that purpose.
Redditors have also turned to resources like IA during times when Reddit's platform changes trigger content removals. Most recently in 2023, when changes to Reddit's public API threatened to kill beloved subreddits, archives stepped in to preserve content before it was lost.
IA has not signaled whether it's looking into fixes to get Reddit's restrictions lifted and did not respond to Ars' request to comment on how this change might impact the archive's utility as an open web resource, given Reddit's popularity.
The director of the Wayback Machine, Mark Graham, told Ars that IA has "a longstanding relationship with Reddit" and continues to have "ongoing discussions about this matter."
It seems likely that Reddit is financially motivated to restrict AI firms from taking advantage of Wayback Machine archives, perhaps hoping to spur more lucrative licensing deals like Reddit struck with OpenAI and Google. The terms of the OpenAI deal were kept quiet, but the Google deal was reportedly worth $60 million, and over the next three years, Reddit expects to make more than $200 million off such licensing deals.