News Publishers Block Internet Archive Access

What happened

Around 245 global news organisations across nine countries are attempting to block the Internet Archive's crawlers from accessing their content on the Wayback Machine. Concerns that AI companies are using archived news for LLM training without permission or payment drive this action. Over 20 major news organisations already block ia_archiverbot, with 241 sites blocking at least one of the Archive's four crawling bots; USA Today Co. owns many of these. The New York Times spokesperson Graham James stated Times content is used by AI companies in copyright violation, directly competing with the publisher.

Why it matters

Access to high-quality, structured historical news data for AI model training is diminishing. This directly impacts AI model developers and research teams relying on such content for building more human-like language models. Procurement teams and legal counsel face increased complexity in sourcing training data, potentially raising costs or limiting model capabilities due to reduced access to structured, attributed, and dated content. This follows the UK government reversing its AI copyright stance after backlash, highlighting ongoing tensions around intellectual property and AI data use.

News Publishers Block Internet Archive Access

What happened

Why it matters

Related articles.

BBC vs. AI Scraping

AI Faces Copyright Scrutiny

Meta sued over AI copyright

AI Firms Skirmish Copyright Battles