How AI scraping is undermining Open Access: A challenge for GLAM and Publishers

Louise Arnold
21 hours ago
3 min read

Updated: 4 hours ago

GLAM (Galleries, Libraries, Archives, and Museums) institutions and publishers share a common mission: to make knowledge accessible. But as a new wave of AI scraping bots descends on their sites, that mission is being put under serious strain.

The irony: open access leads to outages

AI bots harvesting data are overwhelming infrastructure, disrupting user experience, and in some cases, knocking platforms offline.

This GLAM-E Lab report outlines a growing problem:

What happens when demand for digital knowledge becomes so great that it threatens the very platforms designed to provide it? It’s an ironic twist we all need to consider carefully.

Why AI scraping matters now

Over the last year we've seen a dramatic increase in publishers and GLAM institutions struggling to manage bot traffic.

Wikimedia reports AI scraping is putting strain on its servers with a 50% increase YoY in bandwidth for multimedia content and identify that bots generate 65% of its most expensive traffic, despite being just 35% of total visits.
Platforms like DiscoverLife have seen scraping traffic surges that rendered their sites unusable.
There’s been a shift from AI training crawlers toward real-time retrieval bots (RAG), which scrape data at a much faster rate.
Bots often ignore protections like robots.txt, change IPs frequently, and don’t identify themselves as bot, making them nearly impossible to block with traditional tools.

The result? Degraded CX: slower pages, patchy availability, frustrated real users - and ultimately outages.

In many cases, the warning signs appear too late. Analytics tools often filter out bot traffic by default, which means the early signs of a problem may be hidden. A moderate spike might not raise alarms but by the time traffic surges past capacity thresholds, real users are already being affected.

Blocking isn’t so simple

While there are sophisticated techniques to detect and deter bots such as device fingerprinting to rate-limiting, the rapidly evolving nature of scraping means no approach remains effective or affordable for long.

Should institutions block bots outright? Or try to shape how they interact?

It’s not easy when many bots act more like stealth crawlers than cooperative agents. Some claim to be retrieving data “on behalf of a user,” bypassing bot rules entirely.

This creates a difficult balancing act: staying open and accessible, while protecting infrastructure and budgets from being quietly drained.

A broader issue: ethics, economics, and expectations

This isn’t just a tech issue, it’s a governance issue. Cultural and publishing organisations can’t keep scaling servers indefinitely. But they’re also hesitant to restrict access, especially if that means turning away legitimate users.

As AI use grows, open access economics are faltering. Bot traffic brings no traffic or engagement, it just extracts. As a TollBit report shared with Forbes found, AI agents send back 96% less referral traffic than a traditional Google search.

Toward a more sustainable response

There’s no single fix, but several ideas are emerging:

Analytics: Understanding who’s scraping and what they're scraping is step one.
Fair use models: Publishers are testing tools like TollBit and Cloudflare’s permission-based paywalls. These may not suit all GLAM institutions but show how norms might evolve.
Legislation & standards: Regulating bots that impersonate humans may be more impactful than trying to police content reuse.
Monitoring and Load Testing: Identify CX and performance risks before they turn into outages, whether from bots or real users. Proactive CX Monitoring and Real-world Load Testing is part of digital resilience.

Let’s talk — not just block

AI developers, publishers, and cultural institutions all have a stake in keeping the web’s knowledge infrastructure healthy.

We need shared signals. Clearer expectations. And yes possibly new legislation. But above all, we need collaboration. Because building more servers won't fix what is, at heart, a governance problem.

Let’s start that conversation.

Further resources

If you're looking to get ahead of CX and performance issues and identify issues before they escalate, learn more about our Real-world CX Monitoring and Managed Load Testing Services.