Robots.txtis an open standard that is specifically intended to communicate access rules. Thus, while an open web is averse to centralization and proprietary technologies, it does not necessarily mean a porous web. The open web does not necessarily come without financial cost to human users. I see no reason the same principle should not be applied to robots, too.
Therein lies the problem. Site authors can use open standards to restrict access to their content, but the approach for restricting incoming traffic from AI bots has the unintended effect of restricting access to human beings who use AI to navigate the open web. Remember, AI is another tool to surface content. It may be misused/abused in practice, but the philosophical drift of what we know as the open web should allow it.
It’s a convergence of concerns: What is an “open” web that restricts access against tools that extract the content that site owners create, maintain, and publish for use in proprietary services and platforms that are effectively walled gardens?
And iff you’re thinking that scraping open content is inherently wrong (there’s good reason for that), it’s worth mentioning that the Internet Archive itself is a giant scraper, albeit used for the noble purpose of archiving and preserving the web, which is constantly changing and evolving.
Websites like 404 Media have explicitly cited A.I. scraping as the reason for imposing a login wall. A cynical person might view this as a convenient excuse to collect ever-important email addresses and, while I cannot disprove that, it is still a barrier to entry. Then there are the unintended consequences of trying to impose limits on scraping. After Reddit announced it would block the Internet Archive, probably to comply with some kind of exclusivity expectations in its agreements with Google and OpenAI, it implied the Archive does not pass along the
robots.txtrules of the sites in its collection. If a website administrator truly does not want the material on their site to be used for A.I. training, they would need to prevent the Internet Archive from scraping as well — and that would be horrible consequence.
This is the first time I’ve heard of the Really Simple Licensing (RSL) standard, which debuted yesterday:
One thing that might help, not suggested by Masnick, is improving the controls available to publishers. Today marked the launch of the Really Simple Licensing standard offering publishers a way to define machine-readable licenses. These can be applied site-wide, sure, but also at a per-page level. It is up to A.I. companies to adhere to the terms but with an exception — there are ways to permit access to encrypted material.
Compensation and attribution is the nail that the RSL hammer appears to be hitting. Unfortunately, that does nothing to preventing a move towards what Heer explains is the web splitting in two:
I, too, am saddened by an increasingly walled-off web, whether through payment gates or the softer barriers of login or email subscriptions.
Walled gardens. We’ve been concerned about them forever, but most notably with the emergence of Facebook and its propensity to restrict access to shared content by a login. The same is true, even of publishing platforms like Medium. It’s a curated version of the web that feels a lot like the AOL pattern of yesteryears. The difference is that we’re talking about the entire corpus of the open web scraped, repurposed, and redistributed in a completely separate corner of some other web.
Leave a Reply