AI Preferences Protocol – like robots.txt but content control for AI scraping and use?

written by Gagan Ghotra

Published On

Last Updated

Yesterday I posted transcript of an interview of Googler Gary Illyes and in that interview when asked about publishers donโ€™t want their content to be used as a training model and Google should provide site owners with more granular solutions to control how AI systems are using their content.

Gary talked about this IETF (Internet Engineering Task Force) working group which is working on AI Preferences protocol. The goal of this group is to

  1. a common vocabulary to express authors’ and publishers’ preferences regarding use of their content for AI training and related tasks, and
  2. means of attaching that vocabulary to content on the Internet, either by embedding it in the content or by formats similar to robots.txt, and a standard mechanism to reconcile multiple expressions of preferences. 

And in terms of technical details! the goal of group is to deliver this

  • A standard track document covering vocabulary for expressing AI-related preferences, independent of how those preferences are associated with content.
  • Standard track document(s) describing means of attaching or associating those preferences with content in IETF-defined protocols and formats, including but not limited to using Well-Known URIs (RFC 8615) such as the Robots Exclusion Protocol (RFC 9309), and HTTP response header fields.
  • A standard method for reconciling multiple expressions of preferences.

And in the most recent meeting of this group Googler John Mueller was also there! And representatives from both Google and Bing were there in IETF 121 Dublin meeting back in Nov last year to discuss ideas around improving the Robots Exclusion Protocol & adding new AI controls + some other things too.

I think its Nice that Googlers are participating in this! Let’s see what comes out of this and hopefully Google as a company comply with that just like they do for robots.txt even though it’s voluntarily protocol only & not required to be followed per US constitution.

And also I think its just better that site owners wait and see what happens after the official update from IETF is out (which is expected to be out by end of August that’s in like 3 weeks from when I’m writing this) and what big tech companies say about it (comply or not) rather than implementing llms.txt file (which most AI companies are ignoring nowadays anyway).

Its been a while that SEOs have been talking about LLMs.txt and recommending site owners to set it up just like robots.txt but Google have confirmed that they don’t use LLMs.txt and isn’t planning to! But still kind of weird that LLMs.txt is getting mentioned as “Itโ€™s a treasure map for AI” and recommended too! And Practical Ecommerce said “Llms.txt Could Help AI Find Your Store” which is totally wrong as there is not supporting evidence if use of LLMs.txt either helps AI search companies find products in a store easily or having it boosts visibility of products in AI answers.

While from personal experience of working with site owners I’ve noticed no AI companies are respecting LLMs.txt and scraping accordingly.

But still scraping bots from different AI companies are trying to fetch LLMs.txt file as Ray Martinez found in his test experiment. But again no clear evidence if rules in file are being respected or not.

So should you LLMs.txt? yep if you just want to! but there isn’t any clear evidence if AI companies are following the rules in file while scraping. And as I mentioned above once AI preferences protocol information is out from IETF (Internet Engineering Task Force) then only we will get some clarity around what to do for controlling what content on a site AI companies can use during scraping for training or inference time while generating an answer.

Categories:

Leave a Reply

Your email address will not be published. Required fields are marked *