GPTBot Poking Around the Web – How Will It Work?
New info appeared on OpenAI’s website, in the ChatGPT documentation section, about GPTBot, a crawler that will live-scan the Internet, exactly as Googlebot or crawlers from other tools (such as Ahrefs) currently do. The information gathered from the sites will potentially be used to improve OpenAI’s AI models in the future.
OpanAI claims that granting their crawler free access to websites will help create better language models in the future. However, it is possible that some larger and more savvy site owners will block GPTBot – for instance, for fear of losing the uniqueness of the content that is on their webpages.
What Sites Will GPTBot Not Reach?
GPTBot is also supposed to filter sites that use paywall, which means they won’t be scanned. This is quite different from Googlebot. Though, even if you have tour content behind a paywall (which is mostly true for press publishers), you still want Googlebot to have access to said paid content so that it would index and display it on Google. GPT apparently wants to avoid accusations of intellectual property infringement, so it doesn’t want to crawl content from behind a paywall. (All in all, rightly so – we can easily imagine the avalanche of problems this would cause for OpenAI.)
Sites that collect personal information (e.g., social media) or those that contain text that violates OpenAI’s standards will also not be crawled.
How to Modify Robots.txt File for the GPTBot?
It’s not difficult. GPTBot’s access to a site can be blocked or moderated in exactly the same way as Googlebot’s, i.e. with a robots.txt file.
To block GPTBot’s access to the page, type:
User-agent: GPTBot
Disallow: /
To, in turn, modify its access, for example, so that GPTBot can enter only certain subpages, type:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
So, What’s Next?
It’s a good question. We wonder what resources will GPTBot have to crawl the entire Internet. If ChatGPT becomes even more popular, many sites will want its responses to be base on their content – we just wonder if GPT will cite sources, as Google and Bard do.