GPTBot traffic grew 305% in a single year. Cloudflare's 2025 crawler report put a number on what most site owners felt but couldn't prove: AI crawlers are hitting your site harder than ever. The question is whether they can actually read what they find.
Here's the problem. 83% of sites that mention GPTBot in their robots.txt are fully blocking it. If you copied a "protect your content" robots.txt snippet from a blog post in 2024, there's a real chance you're invisible to ChatGPT right now. And you'd never know unless you checked.
This checklist covers the three files that control your AI visibility: robots.txt, llms.txt, and your sitemap. Copy-paste configs included. Total time: about 30 minutes.
Every AI Crawler You Need to Know
Before touching any config file, you need to understand who's crawling and why. Not all AI bots are the same. Some crawl for search results, others for model training, and a few do both depending on the day. The part that trips people up: the user-agent your robots.txt targets only affects one of those modes.
| Crawler | Company | Purpose | robots.txt Respected? |
|---|---|---|---|
| GPTBot | OpenAI | Training + search retrieval | Yes |
| ChatGPT-User | OpenAI | Live browsing (user-initiated) | No (acts like a browser) |
| ClaudeBot | Anthropic | Training + search retrieval | Yes |
| PerplexityBot | Perplexity | Search retrieval | Yes |
| GoogleOther | AI features, Gemini | Yes | |
| Google-Extended | Training only | Yes | |
| Bytespider | ByteDance | Training (TikTok/Doubao) | Sometimes |
| CCBot | Common Crawl | Open training dataset | Yes |
One thing I should've mentioned earlier: ChatGPT-User and Claude-User don't follow robots.txt because they're acting as browsers on behalf of a human user. Blocking GPTBot in robots.txt stops OpenAI's crawler from indexing your content for search, but it won't stop a ChatGPT user from browsing to your page in a conversation. Two different systems, two different user-agents.
15+ content types. Published on your domain. Matched to your brand.
Guides, comparisons, listicles, case studies, and more. RankControl generates content that gets cited by ChatGPT, Claude, Perplexity, and more.
Start free→
robots.txt: What to Allow and What to Block
The safe default for most SaaS sites is: allow search-facing crawlers, block training-only crawlers. Here's the robots.txt snippet:
# Allow AI search crawlers
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: GoogleOther
Allow: /
# Block training-only crawlers
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
Three mistakes that silently kill your AI visibility:
-
A blanket
Disallow: /underUser-agent: *that you forgot about. This blocks every crawler that doesn't have its own specific rules, including newer AI bots you haven't heard of yet. -
Blocking GPTBot because a 2024 blog told you to. This was reasonable advice when AI training was the main concern. In 2026, GPTBot also powers ChatGPT's search feature. Blocking it means ChatGPT can't retrieve your pages when users search.
-
Missing
GoogleOtherentirely. Google usesGoogleOtherfor AI features like Gemini and AI Overviews. If your robots.txt only mentionsGooglebot, you're covered for traditional search but invisible to Google's AI products.
Cloudflare data shows 76.2% of AI crawler traffic is for training, 18.5% for search, and 4.1% for user-initiated browsing. The selective approach above keeps you visible in that 18.5% search slice while opting out of the 76.2% training portion.
What llms.txt Actually Does (And What It Doesn't)
Let me back up for a second. There's a heated debate in the SEO community about whether llms.txt does anything at all. Multiple practitioners have run controlled tests and found no measurable impact on AI citations. SE Ranking published a full analysis titled "Why Brands Rely On It and Why It Doesn't Work." Reddit threads are even harsher.
Here's the honest take: no major AI search engine has publicly confirmed that they read llms.txt for ranking or citation decisions. Not OpenAI, not Anthropic, not Perplexity, not Google.
That said, llms.txt serves a different purpose than most people think. It was never designed to be "robots.txt for AI." Search Engine Land called it a "treasure map," which is closer. It's a structured summary of your site that AI systems can consume quickly. If an AI agent reads your domain, llms.txt gives it a starting point.
Here's a minimal llms.txt file:
# YourCompany
> Short one-line description of what your company does.
## Docs
- [Getting Started](https://yoursite.com/docs/getting-started): Setup guide for new users
- [API Reference](https://yoursite.com/docs/api): Complete API documentation
- [Pricing](https://yoursite.com/pricing): Plans and pricing details
## Blog
- [Most Important Post](https://yoursite.com/blog/key-post): Description of the post
Place it at yoursite.com/llms.txt. Keep it under 50 lines. Only include your highest-value pages. This is a curated list, not a sitemap dump.
One common criticism we've seen: most llms.txt files just list URLs with zero context. If you're going to create one, write the descriptions in plain English. Explain what each page does and who it's for. A list of bare URLs is barely more useful than a sitemap.

Your competitors are getting cited by AI. You're not.
Every day without citation tracking is a day your competitors pull ahead in ChatGPT, Perplexity, and Claude.
See what you're missing→The Complete 2026 AI Visibility Checklist
Here's every technical configuration that affects whether AI search engines can find, read, and cite your content. We track these across our customer base and the pattern is clear: sites that nail all seven see 3-4x more AI citations than sites that only handle two or three.
-
robots.txt allows GPTBot, ClaudeBot, PerplexityBot, GoogleOther. Check for blanket
Disallow: /rules that override specific allows. Test with Google's robots.txt tester. -
No
X-Robots-Tag: noindexon key pages. HTTP headers can block AI crawlers even when robots.txt is clean. Curl your important pages and check the response headers. -
llms.txt at domain root. Include your 10-20 most important pages with clear descriptions. Keep it curated. Update quarterly.
-
Sitemap.xml submitted and current. AI crawlers use sitemaps to discover pages. If your sitemap is stale or broken, new content won't get found. Check
lastmoddates are accurate. -
FAQ schema on key pages.
FAQPagestructured data in JSON-LD format. AI models parse structured data faster than unstructured prose. Our guide to getting cited in 48 hours walks through the implementation. -
Content starts with direct answers. AI models pull citations from the first 30% of page content. If your opening paragraph is brand storytelling instead of answering the user's question, you won't get cited.
-
Server response time under 3 seconds. AI crawlers have timeout limits. Slow pages get abandoned mid-crawl. If you're running server-side rendering with heavy database queries, your technical pages might be timing out for bots even when they load fine for humans.
Worth noting: this checklist intersects with traditional SEO and GEO strategy. Getting the crawl layer right is necessary but not sufficient. Content quality and topical authority still drive most citation decisions. For the deeper technical layer on how AI agents perceive your product, see our guide on making your SaaS discoverable by AI agents.
How to Verify Your Setup
Configuring these files takes 30 minutes. Verifying they work takes another 15. Don't skip this step.
Check robots.txt rules for each AI crawler:
curl -s https://yoursite.com/robots.txt | grep -A 2 "GPTBot"
curl -s https://yoursite.com/robots.txt | grep -A 2 "ClaudeBot"
curl -s https://yoursite.com/robots.txt | grep -A 2 "PerplexityBot"
Test that pages are accessible with AI crawler user-agents:
curl -s -o /dev/null -w "%{http_code}" \
-H "User-Agent: Mozilla/5.0 (compatible; GPTBot/1.0)" \
https://yoursite.com/your-important-page
A 200 response means the page is accessible. A 403 or 429 means your server or CDN is blocking AI bots at the infrastructure level, even if your robots.txt says "Allow."
Check server logs for AI bot traffic:
Search your access logs for these user-agent strings: GPTBot, ClaudeBot, PerplexityBot, Bytespider. If you're getting zero hits from any of these over a 30-day window, either your robots.txt is blocking them or they haven't discovered your site yet.
Verify llms.txt is accessible:
curl -s -o /dev/null -w "%{http_code}" https://yoursite.com/llms.txt
If that returns 404, your file isn't deployed. Check your hosting platform's static file configuration.
The CDN and WAF Gotcha
This catches people more than you'd expect. Your robots.txt looks perfect. Your llms.txt is deployed. But AI crawlers still can't read your site because Cloudflare's Bot Fight Mode, AWS WAF, or Vercel's bot protection is blocking them at the infrastructure layer before the request even reaches your server.
Check your CDN's bot management settings. Cloudflare's Bot Fight Mode is aggressive by default and can block legitimate AI crawlers. If you're seeing 403 responses when you curl with AI user-agents but 200 responses with a normal browser user-agent, this is almost certainly the problem.
The fix varies by provider. On Cloudflare, add WAF custom rules to allow specific AI bot user-agents. On Vercel, check your vercel.json for bot blocking headers. On AWS, update your WAF rules to whitelist GPTBot, ClaudeBot, and PerplexityBot user-agent strings.
Honestly, this one cost us two weeks of debugging on a client site last quarter. Everything looked right in robots.txt. The issue was three layers deeper in the infrastructure stack.
The Part Nobody Talks About: Monitoring
Here's the real challenge with AI crawler access. You set it up once. It works. Then six months later, a developer pushes a new robots.txt rule, your CDN vendor changes their bot protection defaults, or a WordPress plugin update overwrites your config. Your AI visibility drops to zero and nobody notices for weeks.
Configuring AI crawl access is a one-time job. Knowing when it breaks is the actual hard part.
Our AI visibility tracking runs these checks continuously. Every page, every week. When a robots.txt change blocks a crawler or a server starts returning 403s to AI bots, you get an alert before your citations disappear.
You can do all of this monitoring manually. Curl your robots.txt once a week, check server logs for bot traffic, manually query ChatGPT for your brand name. Total time: 2-3 hours per week if you're thorough. Or RankControl's agents handle it automatically while you focus on building your product.
200+ SaaS teams already track their AI citations.
They know exactly when ChatGPT mentions their brand, and when it stops. Do you?
See plans→



