TUTORIAL9 MIN READ

robots.txt for AI Crawlers: GPTBot, ClaudeBot & the Rest

The exact directives, rate-limit patterns, and verification steps for keeping the right AI bots in and the wrong ones out.

Published April 2026 · By The CiteGEO Editorial Team

By default, half the AI ecosystem can't read your site. Most ship-it-and-forget-it robots.txt files were written for Googlebot in 2014, and they accidentally block — or accidentally over-allow — the LLM crawlers that decide whether your brand shows up in answers.

This tutorial gives you the exact rules to copy in. We'll walk through every major AI crawler, the three sensible recipes for handling them, the rate-limiting pattern that prevents abuse, and the verification step most teams skip.

TL;DR

Allow GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and CCBot on your public, indexable content. They're the five that actually drive AI visibility.
Disallow them on sensitive paths like /admin, /api, /checkout, and authenticated routes — same as you would Googlebot.
Rate-limit at the WAF or CDN layer, not in robots.txt — bots ignore the Crawl-delay directive in practice.
Verify via server logs. Look for the User-Agent strings each crawler announces. If you don't see GPTBot in your logs after 30 days, something's blocking it upstream.

The Stakes

Blocking an AI crawler doesn't just mean you don't get cited — it means the model has no fresh information about you. Whatever the training data captured last is what the model knows, indefinitely. Six months later, when the next training cycle completes without your domain, your brand fades from the citation set entirely.

Conversely, the AI crawlers respect robots.txt better than most adversarial bots do. The major LLM crawlers — OpenAI, Anthropic, Google, Perplexity — all publicly commit to honoring your directives. If you don't want them on a specific path, they really won't go there. Use that.

The Crawlers, By Name

Here are the User-Agent strings to address in robots.txt and what each crawler is for:

User-Agent	Operated by	Purpose
`GPTBot`	OpenAI	Training + RAG for ChatGPT
`OAI-SearchBot`	OpenAI	Real-time search for ChatGPT
`ClaudeBot`	Anthropic	Training + RAG for Claude
`Claude-Web`	Anthropic	Claude's web-browse tool
`Google-Extended`	Google	Gemini training (separate from Googlebot)
`PerplexityBot`	Perplexity	Real-time retrieval for Perplexity
`Perplexity-User`	Perplexity	User-initiated fetches
`CCBot`	Common Crawl	Upstream training data for many LLMs
`cohere-ai`	Cohere	Enterprise LLM training
`Bytespider`	ByteDance	Training for Doubao

A few notes on this list. Google-Extended is the one most teams miss — it's a separate User-Agent from Googlebot, and disallowing Googlebot does not disallow Google-Extended. You must address them separately. Similarly, OpenAI's OAI-SearchBot is distinct from GPTBot; one fetches for real-time search, the other for training and RAG. If you only allow GPTBot you'll still be invisible in ChatGPT's live search results.

Groq doesn't operate its own crawler — it draws from upstream model training and retrieves through the application layer. So there's no Groq-specific robots.txt rule needed, but the upstream models it serves rely on the bots above.

Recipe: Allow All AI Bots

The simplest and most common recipe: let everyone in, blocking only the sensitive paths you'd block from any crawler.

# robots.txt — allow all major AI crawlers
User-agent: GPTBot
Allow: /
Disallow: /admin
Disallow: /api
Disallow: /checkout

User-agent: OAI-SearchBot
Allow: /
Disallow: /admin
Disallow: /api
Disallow: /checkout

User-agent: ClaudeBot
Allow: /
Disallow: /admin
Disallow: /api
Disallow: /checkout

User-agent: Claude-Web
Allow: /
Disallow: /admin
Disallow: /api
Disallow: /checkout

User-agent: Google-Extended
Allow: /
Disallow: /admin
Disallow: /api
Disallow: /checkout

User-agent: PerplexityBot
Allow: /
Disallow: /admin
Disallow: /api
Disallow: /checkout

User-agent: Perplexity-User
Allow: /

User-agent: CCBot
Allow: /
Disallow: /admin
Disallow: /api
Disallow: /checkout

Sitemap: https://example.com/sitemap.xml

This is the right default for any public-facing marketing site. You explicitly opt in each crawler, you keep them out of admin and purchase flows, and you point them at your sitemap.

Recipe: Selective Allow

If you have legal or compliance reasons to be selective (e.g. some teams choose not to allow training but do allow live retrieval), the split looks like this:

# Allow live retrieval; deny training-only crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

This is a real and increasingly common configuration in regulated industries. The tradeoff: you stay reachable for real-time citations, but your content slowly fades from the underlying model knowledge as new training cycles run without you. Most teams should not do this unless they have a specific compliance driver.

Recipe: Block All AI Bots

Sometimes you really don't want to be in the answer layer. Pre-launch products, regulated documents, internal-only content published mistakenly to the public web. The full block:

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Perplexity-User
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Bytespider
Disallow: /

Important caveat: this only stops well-behaved crawlers. Adversarial scrapers that ignore robots.txt still hit you. For genuinely sensitive content, use authentication — notrobots.txt.

Rate-Limiting Pattern

robots.txt has a Crawl-delay directive. The major AI crawlers ignore it. They have their own internal rate limits and don't honor the per-site override.

If you have a real rate-limit problem (you'll know — your server logs will tell you), do it at the WAF or CDN layer:

Cloudflare: use a rate-limit rule scoped to the User-Agent containing “Bot”. Set a generous threshold (say 60 requests per minute per IP) — anything stricter starts blocking legitimate retrievals.
AWS WAF: a RateBasedStatement filtered by the AI User-Agents. Same numbers — too aggressive costs you visibility.
Nginx / direct: limit_req_zone $http_user_agent with a 60r/m bucket.

For most sites this is over-engineering. The total request volume from all AI crawlers combined is usually a fraction of normal user traffic. Don't pre-emptively rate-limit unless you have an actual problem.

How to Verify Bots Are Crawling

Three weeks after pushing your robots.txt change, check that the bots actually showed up. The fastest way is grep on your access logs:

# Sample: count requests by AI crawler in the last 7 days
awk '$0 ~ /(GPTBot|ClaudeBot|PerplexityBot|Google-Extended|CCBot)/' \
  access.log \
  | grep -oE '(GPTBot|ClaudeBot|PerplexityBot|Google-Extended|CCBot)' \
  | sort | uniq -c | sort -rn

You want to see each crawler with a non-trivial count — at least dozens of hits per week for any reasonably-sized site. Zero hits for a specific crawler means either (1) something upstream is blocking it (CDN rule, WAF, IP block), or (2) your robots.txt still has a stale disallow you forgot about.

For verification beyond log inspection: each crawler operator publishes its IP ranges. You can reverse-DNS lookups to confirm a request claiming to be from GPTBot is actually from OpenAI's IP block. CiteGEO does this verification on every audit — if you'd rather not script it yourself, a free account gets you the full crawler-access readiness check.

robots.txt vs llms.txt

They're complementary, not redundant. robots.txt tells crawlers where they can and can't go. llms.txt tells them what your site is about and which pages matter most for their use case. Most teams should ship both — start with robots.txt because it's ubiquitously supported, add llms.txt once the basics are in place.

If you only ship one in the next 30 minutes, ship robots.txt. The cost of getting it wrong is invisibility; the cost of getting it right is zero.

Common Pitfalls

Inheriting a `User-agent: *` block

A blanket User-agent: * with Disallow: / (sometimes left over from a staging environment) blocks every crawler including the AI ones. The bot-specific rules above do NOT override the wildcard if the wildcard appears later. Put the bot-specific blocks first, or remove the wildcard.

Case sensitivity

The bots match their User-Agent string case-insensitively, but some validators don't. Stick to the exact casing OpenAI, Anthropic, and Google publish (capital first letters as in the table above) and you're safe everywhere.

Path syntax mismatch

Disallow: /admin blocks /admin and everything below. Disallow: /admin$ only blocks the exact path. Most teams want the former, but a small minority want the latter — be deliberate.

Forgetting Google-Extended

The most common “why am I invisible in Gemini” root cause. Google-Extended is its own User-Agent. Googlebot rules don't apply to it.

Hosting a `robots.txt` behind auth

A surprisingly common mistake: the staging-environment robots.txt got promoted to production behind a basic auth header. Bots see a 401 and assume nothing's allowed. Your robots.txt must be reachable unauthenticated on the public web.

Ship the right robots.txt today, verify in three weeks, then move on. The leverage is enormous and the work is one-and-done.