Complete reference: AI bot user-agents in 2026
Each AI company deploys multiple bots with distinct roles. Here is the complete reference of user-agent strings to know:
| User-agent | Company | Role | Impact if blocked |
|---|---|---|---|
| GPTBot | OpenAI | Model training | Exclusion from future OpenAI corpora |
| OAI-SearchBot | OpenAI | ChatGPT Search (retrieval) | Not cited in ChatGPT Search |
| ChatGPT-User | OpenAI | ChatGPT browsing (plugins) | No ChatGPT browsing on your site |
| PerplexityBot | Perplexity | Perplexity indexation + retrieval | Not cited in Perplexity |
| Perplexity-User | Perplexity | Perplexity user queries | Reduced Perplexity visibility |
| ClaudeBot | Anthropic | Claude training + retrieval | Exclusion from Anthropic corpus |
| Claude-Web | Anthropic | Claude web browsing | No Claude browsing on your site |
| anthropic-ai | Anthropic | Generic Anthropic crawler | Exclusion from Anthropic corpus |
| Google-Extended | Gemini training | Exclusion from Gemini corpus (not SERPs) | |
| Applebot-Extended | Apple | Apple Intelligence training | Exclusion from Apple Intelligence corpus |
| CCBot | Common Crawl | Open source corpus (used by many LLMs) | Exclusion from many open-source LLM corpora |
| cohere-ai | Cohere | Cohere model training | Exclusion from Cohere corpus |
| meta-externalagent | Meta | Llama / Meta AI training | Exclusion from Meta corpus |
| Bytespider | ByteDance | ByteDance model training | Exclusion from ByteDance corpus |
The 4 standard robots.txt configurations
Configuration 1 - Allow everything (maximum visibility strategy)
No specific directives for AI bots: they follow the general rules of your robots.txt. Recommended if your objective is maximum visibility across all LLMs and AI engines.
User-agent: *
Disallow:
# Sitemap
Sitemap: https://yoursite.com/sitemap.xml Configuration 2 - Block training, allow retrieval
Block training bots (GPTBot, Google-Extended, CCBot, meta-externalagent, Bytespider) while allowing real-time retrieval bots (OAI-SearchBot, PerplexityBot). You keep visibility in ChatGPT Search and Perplexity without feeding training corpora.
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Applebot-Extended
Disallow: /
# Retrieval allowed
User-agent: OAI-SearchBot
Disallow:
User-agent: ChatGPT-User
Disallow:
User-agent: PerplexityBot
Disallow:
User-agent: Perplexity-User
Disallow:
User-agent: ClaudeBot
Disallow:
User-agent: *
Disallow:
Sitemap: https://yoursite.com/sitemap.xml Configuration 3 - Block everything (defensive strategy)
Block all known AI bots. To use only if you have strong legal or commercial reasons (proprietary content, copyright, direct competition with LLMs). Impact: near-absence from LLM and AI engine responses.
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Perplexity-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: *
Disallow:
Sitemap: https://yoursite.com/sitemap.xml Configuration 4 - Selective folder blocking
Allow general crawling but block specific sections (paid content, proprietary data, archives). Useful for media outlets and SaaS with a public and a private part.
User-agent: GPTBot
Disallow: /premium-content/
Disallow: /proprietary-data/
Disallow: /app/
User-agent: *
Disallow:
Sitemap: https://yoursite.com/sitemap.xml Common configuration pitfalls
Pitfall 1 - Confusing GPTBot and OAI-SearchBot
This is the most common error. A site that blocks GPTBot thinking it is blocking ChatGPT Search has only blocked OpenAI training. OAI-SearchBot continues to crawl freely. Verify that your rules target the right user-agents for your actual objectives.
Pitfall 2 - Rule order in robots.txt
Bots respect the first matching rule for their user-agent. If you have
a User-agent: * Disallow: / at the top of the file, it will take priority over specific rules
that follow for bots that do not match a precise user-agent. Always put specific
rules before the * rule.
Pitfall 3 - Case sensitivity in user-agents
User-agent strings in robots.txt are case-sensitive. GPTBot (capital P)
is different from gptbot. Always use user-agents in the official
case published by each company (reference in the table above).
Pitfall 4 - Forgetting Crawl-delay for aggressive bots
Some less well-behaved bots (notably CCBot and Bytespider) may ignore
Crawl-delay directives. For bots that respect them, a value of 10 to 30
seconds reduces server load without blocking the crawl. For bots that ignore this
directive, a WAF rule (Cloudflare) by user-agent is more effective.
Pitfall 5 - Not updating robots.txt after new bots
New AI bots appear regularly. In 2025, Amazon Alexa AI, Grok (xAI), and several open-source LLM crawlers were deployed. Check and update your robots.txt quarterly by consulting official announcements from major AI companies.
Verifying and testing your configuration
Test via curl
Simulate the user-agent of each bot to verify what it sees:
# Test as GPTBot
curl -A "Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.2; +https://openai.com/gptbot)" https://yoursite.com/robots.txt
# Test as PerplexityBot
curl -A "PerplexityBot/1.0" https://yoursite.com/robots.txt Test via Google Search Console
The robots.txt testing tool in GSC (Settings > robots.txt) lets you test any user-agent against your file. Paste the user-agent string and the URL to test.
Server log monitoring
Nginx/Apache/Cloudflare logs show requests from each bot with their actual user-agent.
Filter with grep -i "gptbot\|oai-searchbot\|perplexitybot" to see their activity.
This is also the method to detect bots that ignore your robots.txt.
FAQ - robots.txt and AI bots
- Does a Disallow on GPTBot block ChatGPT Search?
- No. GPTBot and OAI-SearchBot are two distinct bots. Blocking GPTBot leaves OAI-SearchBot free to crawl. You must target each bot separately according to your objectives.
- Is robots.txt the only way to block AI bots?
-
No. The meta robots tag (
noai,noimageai), the X-Robots-Tag header, and WAF/Cloudflare rules are alternatives. robots.txt remains the most universal and easiest signal to maintain. - How do you verify that your robots.txt rules are being applied?
- Via GSC (robots.txt testing tool), via curl simulating the user-agent, and via server logs to confirm that bots respect your directives.
- Should you use a Crawl-delay for AI bots?
- Only if your server is under pressure. Well-configured bots (GPTBot, PerplexityBot) respect 429 and Retry-After. Note: Googlebot ignores Crawl-delay; use GSC parameters to regulate it.
robots.txt AI bots checklist (7 points)
- The robots.txt configuration matches your strategy (max visibility, training only, or defensive).
- GPTBot and OAI-SearchBot have separate rules if your objectives differ.
- User-agent strings are in the correct case (GPTBot, OAI-SearchBot, PerplexityBot).
- Specific rules precede the generic User-agent: * rule.
- The file has been tested via GSC and/or curl for each relevant bot.
- Server logs are configured to monitor AI bot activity.
- A quarterly review is planned to incorporate new AI bots.