What is llms.txt and how does it differ from robots.txt?

llms.txt is a plain text file at your domain root that tells AI models which pages contain your most important content. robots.txt controls crawler access (allow/block). They serve different purposes: robots.txt is a gate, llms.txt is a map.

Which AI crawlers should I allow in robots.txt?

At minimum, allow GPTBot (OpenAI/ChatGPT), ClaudeBot (Anthropic/Claude), PerplexityBot (Perplexity), and GoogleOther (Google AI). Block only training-specific crawlers like Google-Extended or CCBot if you want to prevent training while keeping search visibility.

Does llms.txt actually improve AI search visibility?

llms.txt alone does not guarantee citations. No major AI search engine has confirmed reading it for ranking purposes. But it is part of a broader technical checklist that includes robots.txt access, structured data, and content formatting. Skip the whole stack and you are invisible.

How do I verify AI crawlers can access my site?

Fetch your robots.txt and check for Disallow rules under AI user-agents. Use curl to request pages with each bot's user-agent string. Check server logs for GPTBot, ClaudeBot, and PerplexityBot hits. If you see zero AI bot traffic, something is blocking them.

Should I block AI crawlers to protect my content?

Blocking all AI crawlers means opting out of AI search entirely. A better approach is selective blocking: allow search-facing crawlers like GPTBot and PerplexityBot while blocking training-only crawlers. This protects your content from training datasets while keeping you visible in AI answers.

llms.txt & robots.txt for AI Crawlers: 2026 Checklist

GPTBot traffic grew 305% in a single year. Cloudflare's 2025 crawler report put a number on what most site owners felt but couldn't prove: AI crawlers are hitting your site harder than ever. The question is whether they can actually read what they find.

Here's the problem. 83% of sites that mention GPTBot in their robots.txt are fully blocking it. If you copied a "protect your content" robots.txt snippet from a blog post in 2024, there's a real chance you're invisible to ChatGPT right now. And you'd never know unless you checked.

This checklist covers the three files that control your AI visibility: robots.txt, llms.txt, and your sitemap. Copy-paste configs included. Total time: about 30 minutes.

Every AI Crawler You Need to Know

Before touching any config file, you need to understand who's crawling and why. Not all AI bots are the same. Some crawl for search results, others for model training, and a few do both depending on the day. The part that trips people up: the user-agent your robots.txt targets only affects one of those modes.

Crawler	Company	Purpose	robots.txt Respected?
GPTBot	OpenAI	Training + search retrieval	Yes
ChatGPT-User	OpenAI	Live browsing (user-initiated)	No (acts like a browser)
ClaudeBot	Anthropic	Training + search retrieval	Yes
PerplexityBot	Perplexity	Search retrieval	Yes
GoogleOther	Google	AI features, Gemini	Yes
Google-Extended	Google	Training only	Yes
Bytespider	ByteDance	Training (TikTok/Doubao)	Sometimes
CCBot	Common Crawl	Open training dataset	Yes

One thing I should've mentioned earlier: ChatGPT-User and Claude-User don't follow robots.txt because they're acting as browsers on behalf of a human user. Blocking GPTBot in robots.txt stops OpenAI's crawler from indexing your content for search, but it won't stop a ChatGPT user from browsing to your page in a conversation. Two different systems, two different user-agents.

RANKCONTROL

15+ content types. Published on your domain. Matched to your brand.

Guides, comparisons, listicles, case studies, and more. RankControl generates content that gets cited by ChatGPT, Claude, Perplexity, and more.

Start free→

Person at desk with content pages flying out of laptop and content calendar board

robots.txt: What to Allow and What to Block

The safe default for most SaaS sites is: allow search-facing crawlers, block training-only crawlers. Here's the robots.txt snippet:

# Allow AI search crawlers
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: GoogleOther
Allow: /

# Block training-only crawlers
User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

Three mistakes that silently kill your AI visibility:

A blanket Disallow: / under User-agent: * that you forgot about. This blocks every crawler that doesn't have its own specific rules, including newer AI bots you haven't heard of yet.
Blocking GPTBot because a 2024 blog told you to. This was reasonable advice when AI training was the main concern. In 2026, GPTBot also powers ChatGPT's search feature. Blocking it means ChatGPT can't retrieve your pages when users search.
Missing GoogleOther entirely. Google uses GoogleOther for AI features like Gemini and AI Overviews. If your robots.txt only mentions Googlebot, you're covered for traditional search but invisible to Google's AI products.

Cloudflare data shows 76.2% of AI crawler traffic is for training, 18.5% for search, and 4.1% for user-initiated browsing. The selective approach above keeps you visible in that 18.5% search slice while opting out of the 76.2% training portion.

What llms.txt Actually Does (And What It Doesn't)

Let me back up for a second. There's a heated debate in the SEO community about whether llms.txt does anything at all. Multiple practitioners have run controlled tests and found no measurable impact on AI citations. SE Ranking published a full analysis titled "Why Brands Rely On It and Why It Doesn't Work." Reddit threads are even harsher.

Here's the honest take: no major AI search engine has publicly confirmed that they read llms.txt for ranking or citation decisions. Not OpenAI, not Anthropic, not Perplexity, not Google.

That said, llms.txt serves a different purpose than most people think. It was never designed to be "robots.txt for AI." Search Engine Land called it a "treasure map," which is closer. It's a structured summary of your site that AI systems can consume quickly. If an AI agent reads your domain, llms.txt gives it a starting point.

Here's a minimal llms.txt file:

# YourCompany

> Short one-line description of what your company does.

## Docs

- [Getting Started](https://yoursite.com/docs/getting-started): Setup guide for new users
- [API Reference](https://yoursite.com/docs/api): Complete API documentation
- [Pricing](https://yoursite.com/pricing): Plans and pricing details

## Blog

- [Most Important Post](https://yoursite.com/blog/key-post): Description of the post

Place it at yoursite.com/llms.txt. Keep it under 50 lines. Only include your highest-value pages. This is a curated list, not a sitemap dump.

One common criticism we've seen: most llms.txt files just list URLs with zero context. If you're going to create one, write the descriptions in plain English. Explain what each page does and who it's for. A list of bare URLs is barely more useful than a sitemap.

Person looking through binoculars at a competitor dashboard with rising charts

Your competitors are getting cited by AI. You're not.

Every day without citation tracking is a day your competitors pull ahead in ChatGPT, Perplexity, and Claude.

See what you're missing→

The Complete 2026 AI Visibility Checklist

Here's every technical configuration that affects whether AI search engines can find, read, and cite your content. We track these across our customer base and the pattern is clear: sites that nail all seven see 3-4x more AI citations than sites that only handle two or three.

robots.txt allows GPTBot, ClaudeBot, PerplexityBot, GoogleOther. Check for blanket Disallow: / rules that override specific allows. Test with Google's robots.txt tester.
No X-Robots-Tag: noindex on key pages. HTTP headers can block AI crawlers even when robots.txt is clean. Curl your important pages and check the response headers.
llms.txt at domain root. Include your 10-20 most important pages with clear descriptions. Keep it curated. Update quarterly.
Sitemap.xml submitted and current. AI crawlers use sitemaps to discover pages. If your sitemap is stale or broken, new content won't get found. Check lastmod dates are accurate.
FAQ schema on key pages. FAQPage structured data in JSON-LD format. AI models parse structured data faster than unstructured prose. Our guide to getting cited in 48 hours walks through the implementation.
Content starts with direct answers. AI models pull citations from the first 30% of page content. If your opening paragraph is brand storytelling instead of answering the user's question, you won't get cited.
Server response time under 3 seconds. AI crawlers have timeout limits. Slow pages get abandoned mid-crawl. If you're running server-side rendering with heavy database queries, your technical pages might be timing out for bots even when they load fine for humans.

Worth noting: this checklist intersects with traditional SEO and GEO strategy. Getting the crawl layer right is necessary but not sufficient. Content quality and topical authority still drive most citation decisions. For the deeper technical layer on how AI agents perceive your product, see our guide on making your SaaS discoverable by AI agents.

How to Verify Your Setup

Configuring these files takes 30 minutes. Verifying they work takes another 15. Don't skip this step.

Check robots.txt rules for each AI crawler:

curl -s https://yoursite.com/robots.txt | grep -A 2 "GPTBot"
curl -s https://yoursite.com/robots.txt | grep -A 2 "ClaudeBot"
curl -s https://yoursite.com/robots.txt | grep -A 2 "PerplexityBot"

Test that pages are accessible with AI crawler user-agents:

curl -s -o /dev/null -w "%{http_code}" \
  -H "User-Agent: Mozilla/5.0 (compatible; GPTBot/1.0)" \
  https://yoursite.com/your-important-page

A 200 response means the page is accessible. A 403 or 429 means your server or CDN is blocking AI bots at the infrastructure level, even if your robots.txt says "Allow."

Check server logs for AI bot traffic:

Search your access logs for these user-agent strings: GPTBot, ClaudeBot, PerplexityBot, Bytespider. If you're getting zero hits from any of these over a 30-day window, either your robots.txt is blocking them or they haven't discovered your site yet.

Verify llms.txt is accessible:

curl -s -o /dev/null -w "%{http_code}" https://yoursite.com/llms.txt

If that returns 404, your file isn't deployed. Check your hosting platform's static file configuration.

The CDN and WAF Gotcha

This catches people more than you'd expect. Your robots.txt looks perfect. Your llms.txt is deployed. But AI crawlers still can't read your site because Cloudflare's Bot Fight Mode, AWS WAF, or Vercel's bot protection is blocking them at the infrastructure layer before the request even reaches your server.

Check your CDN's bot management settings. Cloudflare's Bot Fight Mode is aggressive by default and can block legitimate AI crawlers. If you're seeing 403 responses when you curl with AI user-agents but 200 responses with a normal browser user-agent, this is almost certainly the problem.

The fix varies by provider. On Cloudflare, add WAF custom rules to allow specific AI bot user-agents. On Vercel, check your vercel.json for bot blocking headers. On AWS, update your WAF rules to whitelist GPTBot, ClaudeBot, and PerplexityBot user-agent strings.

Honestly, this one cost us two weeks of debugging on a client site last quarter. Everything looked right in robots.txt. The issue was three layers deeper in the infrastructure stack.

The Part Nobody Talks About: Monitoring

Here's the real challenge with AI crawler access. You set it up once. It works. Then six months later, a developer pushes a new robots.txt rule, your CDN vendor changes their bot protection defaults, or a WordPress plugin update overwrites your config. Your AI visibility drops to zero and nobody notices for weeks.

Configuring AI crawl access is a one-time job. Knowing when it breaks is the actual hard part.

Our AI visibility tracking runs these checks continuously. Every page, every week. When a robots.txt change blocks a crawler or a server starts returning 403s to AI bots, you get an alert before your citations disappear.

You can do all of this monitoring manually. Curl your robots.txt once a week, check server logs for bot traffic, manually query ChatGPT for your brand name. Total time: 2-3 hours per week if you're thorough. Or RankControl's agents handle it automatically while you focus on building your product.

RANKCONTROL

200+ SaaS teams already track their AI citations.

They know exactly when ChatGPT mentions their brand, and when it stops. Do you?

See plans→