Insights · Technical

robots.txt and AI bots: complete configuration guide 2026

The proliferation of AI bots has made robots.txt configuration more complex. Each crawler has its own user-agent string, sometimes multiple. This guide lists all major AI bots with their exact user-agents, common use cases, and ready-to-copy robots.txt file examples.

Mis à jour 22 April 2026 12 min de lecture

Complete reference: AI bot user-agents in 2026

Each AI company deploys multiple bots with distinct roles. Here is the complete reference of user-agent strings to know:

User-agent Company Role Impact if blocked
GPTBot OpenAI Model training Exclusion from future OpenAI corpora
OAI-SearchBot OpenAI ChatGPT Search (retrieval) Not cited in ChatGPT Search
ChatGPT-User OpenAI ChatGPT browsing (plugins) No ChatGPT browsing on your site
PerplexityBot Perplexity Perplexity indexation + retrieval Not cited in Perplexity
Perplexity-User Perplexity Perplexity user queries Reduced Perplexity visibility
ClaudeBot Anthropic Claude training + retrieval Exclusion from Anthropic corpus
Claude-Web Anthropic Claude web browsing No Claude browsing on your site
anthropic-ai Anthropic Generic Anthropic crawler Exclusion from Anthropic corpus
Google-Extended Google Gemini training Exclusion from Gemini corpus (not SERPs)
Applebot-Extended Apple Apple Intelligence training Exclusion from Apple Intelligence corpus
CCBot Common Crawl Open source corpus (used by many LLMs) Exclusion from many open-source LLM corpora
cohere-ai Cohere Cohere model training Exclusion from Cohere corpus
meta-externalagent Meta Llama / Meta AI training Exclusion from Meta corpus
Bytespider ByteDance ByteDance model training Exclusion from ByteDance corpus

The 4 standard robots.txt configurations

Configuration 1 - Allow everything (maximum visibility strategy)

No specific directives for AI bots: they follow the general rules of your robots.txt. Recommended if your objective is maximum visibility across all LLMs and AI engines.

User-agent: *
Disallow:

# Sitemap
Sitemap: https://yoursite.com/sitemap.xml

Configuration 2 - Block training, allow retrieval

Block training bots (GPTBot, Google-Extended, CCBot, meta-externalagent, Bytespider) while allowing real-time retrieval bots (OAI-SearchBot, PerplexityBot). You keep visibility in ChatGPT Search and Perplexity without feeding training corpora.

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Retrieval allowed
User-agent: OAI-SearchBot
Disallow:

User-agent: ChatGPT-User
Disallow:

User-agent: PerplexityBot
Disallow:

User-agent: Perplexity-User
Disallow:

User-agent: ClaudeBot
Disallow:

User-agent: *
Disallow:

Sitemap: https://yoursite.com/sitemap.xml

Configuration 3 - Block everything (defensive strategy)

Block all known AI bots. To use only if you have strong legal or commercial reasons (proprietary content, copyright, direct competition with LLMs). Impact: near-absence from LLM and AI engine responses.

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Disallow:

Sitemap: https://yoursite.com/sitemap.xml

Configuration 4 - Selective folder blocking

Allow general crawling but block specific sections (paid content, proprietary data, archives). Useful for media outlets and SaaS with a public and a private part.

User-agent: GPTBot
Disallow: /premium-content/
Disallow: /proprietary-data/
Disallow: /app/

User-agent: *
Disallow:

Sitemap: https://yoursite.com/sitemap.xml

Common configuration pitfalls

Pitfall 1 - Confusing GPTBot and OAI-SearchBot

This is the most common error. A site that blocks GPTBot thinking it is blocking ChatGPT Search has only blocked OpenAI training. OAI-SearchBot continues to crawl freely. Verify that your rules target the right user-agents for your actual objectives.

Pitfall 2 - Rule order in robots.txt

Bots respect the first matching rule for their user-agent. If you have a User-agent: * Disallow: / at the top of the file, it will take priority over specific rules that follow for bots that do not match a precise user-agent. Always put specific rules before the * rule.

Pitfall 3 - Case sensitivity in user-agents

User-agent strings in robots.txt are case-sensitive. GPTBot (capital P) is different from gptbot. Always use user-agents in the official case published by each company (reference in the table above).

Pitfall 4 - Forgetting Crawl-delay for aggressive bots

Some less well-behaved bots (notably CCBot and Bytespider) may ignore Crawl-delay directives. For bots that respect them, a value of 10 to 30 seconds reduces server load without blocking the crawl. For bots that ignore this directive, a WAF rule (Cloudflare) by user-agent is more effective.

Pitfall 5 - Not updating robots.txt after new bots

New AI bots appear regularly. In 2025, Amazon Alexa AI, Grok (xAI), and several open-source LLM crawlers were deployed. Check and update your robots.txt quarterly by consulting official announcements from major AI companies.

Verifying and testing your configuration

Test via curl

Simulate the user-agent of each bot to verify what it sees:

# Test as GPTBot
curl -A "Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)" https://yoursite.com/robots.txt

# Test as PerplexityBot
curl -A "PerplexityBot/1.0" https://yoursite.com/robots.txt

Test via Google Search Console

The robots.txt testing tool in GSC (Settings > robots.txt) lets you test any user-agent against your file. Paste the user-agent string and the URL to test.

Server log monitoring

Nginx/Apache/Cloudflare logs show requests from each bot with their actual user-agent. Filter with grep -i "gptbot\|oai-searchbot\|perplexitybot" to see their activity. This is also the method to detect bots that ignore your robots.txt.

FAQ - robots.txt and AI bots

Does a Disallow on GPTBot block ChatGPT Search?
No. GPTBot and OAI-SearchBot are two distinct bots. Blocking GPTBot leaves OAI-SearchBot free to crawl. You must target each bot separately according to your objectives.
Is robots.txt the only way to block AI bots?
No. The meta robots tag (noai, noimageai), the X-Robots-Tag header, and WAF/Cloudflare rules are alternatives. robots.txt remains the most universal and easiest signal to maintain.
How do you verify that your robots.txt rules are being applied?
Via GSC (robots.txt testing tool), via curl simulating the user-agent, and via server logs to confirm that bots respect your directives.
Should you use a Crawl-delay for AI bots?
Only if your server is under pressure. Well-configured bots (GPTBot, PerplexityBot) respect 429 and Retry-After. Note: Googlebot ignores Crawl-delay; use GSC parameters to regulate it.

robots.txt AI bots checklist (7 points)

  1. The robots.txt configuration matches your strategy (max visibility, training only, or defensive).
  2. GPTBot and OAI-SearchBot have separate rules if your objectives differ.
  3. User-agent strings are in the correct case (GPTBot, OAI-SearchBot, PerplexityBot).
  4. Specific rules precede the generic User-agent: * rule.
  5. The file has been tested via GSC and/or curl for each relevant bot.
  6. Server logs are configured to monitor AI bot activity.
  7. A quarterly review is planned to incorporate new AI bots.