pecification: A User Manual for Your Site
The llms.txt file provides a standardized way to give instructions to AI models about your site's usage policy. It functions like robots.txt but focuses on generative usage permissions rather than just crawl access.
Placement & Format:
Place the file in the /.well-known/ directory: https://yourdomain.com/.well-known/llms.txt. It uses a field: value format.
Key Fields:
User-Agent: Targets specific bots (* for all, or ClaudeBot).
Allow / Disallow: Controls directories/pages permitted for training.
Allow-Citing: Explicitly permits citation in model outputs.
Implementation Example:
# Default policy for all LLM agents
User-Agent: *
Disallow: /members/
Disallow: /private-data/
# Allow all bots to cite our public articles
User-Agent: *
Allow-Citing: /articles/
# Specific rules for ClaudeBot, if needed
User-Agent: ClaudeBot
Allow: /
Pros & Cons:
- Pro: Machine-readable usage terms replace buried human-readable ToS pages. Signals technical readiness.
- Con: Still a proposal; not all vendors honor it yet (e.g., OpenAI currently relies on
robots.txt). Requires maintenance as site architecture evolves.
2. JSON-LD: Spoon-Feeding Structured Data to Machines
JSON-LD embeds structured data directly in HTML using Schema.org vocabulary. It tells AI agents exactly what a page represents, eliminating guesswork. Place the script tag within the <head> of your HTML.
Key Schemas for AI Agents:
- Article: Defines author, date, headline, and body for accurate attribution.
- Product: Exposes pricing, availability, and reviews for comparison queries.
- FAQPage: Pre-packaged Q&A pairs that agents can directly surface.
- HowTo: Breaks processes into discrete, reformatable steps.
Implementation Examples:
<br> {<br> "[@context](https://dev.to/context)": "<a href="https://schema.org">https://schema.org</a>",<br> "@type": "Article",<br> "headline": "How to Make Your Website AI-Agent Readable",<br> "author": {<br> "@type": "Organization",<br> "name": "GuardLabs"<br> },<br> "datePublished": "2024-05-21"<br> }<br>
<br> {<br> "[@context](https://dev.to/context)": "<a href="https://schema.org">https://schema.org</a>",<br> "@type": "Product",<br> "name": "Website Care Plan",<br> "image": "<a href="https://guardlabs.online/images/care-icon.png">https://guardlabs.online/images/care-icon.png</a>",<br> "description": "Annual website maintenance and support.",<br> "offers": {<br> "@type": "Offer",<br> "priceCurrency": "USD",<br> "price": "240.00"<br> }<br> }<br>
Critical Constraint: Schema accuracy is paramount. Mismatched data (e.g., HTML price vs. JSON-LD price) triggers bot distrust and reduces citation likelihood.
3. MCP Cards: A Business Card for Your Server
The Machine-readable Citable Page (MCP) protocol provides a parallel JSON file containing core citable facts, bypassing HTML parsing overhead. Agents fetch https://yourdomain.com/my-article.mcp.json for clean, structured data.
Implementation Strategy:
- Deploy MCP cards only for data-rich, citable content (reports, product pages, reference guides).
- Host static JSON files at predictable URLs (append
.mcp.json).
- Link to the MCP card from the HTML page using a
<link rel="mcp-card" href="..."> tag in the <head>.
4. AI Crawler Management & robots.txt Configuration
AI crawlers are actively ingesting the web. Understanding their purpose and configuring robots.txt correctly is foundational.
| Company | Purpose | Honors robots.txt? |
|---|
GPTBot | OpenAI | Crawls web data to improve future ChatGPT models. |
ClaudeBot | Anthropic | Used for training Claude models. |
PerplexityBot | Perplexity AI | Crawls the web to find answers for Perplexity's conversational search engine. |
Google-Extended | Google | A separate crawler Google uses to improve Bard/Gemini. Opting out here does not affect Google Search. |
CCBot | Common Crawl | Not a company, but a non-profit that crawls and archives the web. Its data is widely used to train many open-source and commercial LLMs. |
Permissive Configuration Example:
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
# You might want to disallow CCBot if you are concerned about
# your content being in a public dataset forever.
User-agent: CCBot
Disallow: /
# Keep your existing rules for other bots
User-agent: *
Disallow: /admin
Disallow: /private/
Note: Bandwidth impact is negligible. The primary risk is exclusion from the training/citation ecosystem by over-restricting access.
5. Verification & Testing
You cannot rely on assumptions. Verify ingestion from the agent's perspective:
- Server Logs: Filter access logs for known user agents.
grep "GPTBot" /var/log/nginx/access.log. Look for 200 OK. 403 or 503 indicates blocking.
curl Impersonation: Simulate crawler requests to debug CDN/firewall rules.
curl -A "GPTBot" -I https://yourdomain.com/my-article
Expect HTTP/2 200. CAPTCHAs or 403 responses mean security layers are blocking ingestion.
- Citation Validation: After 2β4 weeks of confirmed crawling, test prompt engineering to verify if agents cite your structured data in generated responses.
Pitfall Guide
- Incomplete or Inconsistent JSON-LD: Providing schema that doesn't match visible HTML content (e.g., mismatched pricing or dates) causes LLMs to flag the page as unreliable, drastically reducing citation probability.
- Over-Restrictive
robots.txt Defaults: Blocking all unknown or AI-specific crawlers by default guarantees exclusion from the generative AI ecosystem. Adopt a permissive baseline for verified AI bots and restrict only sensitive paths.
- MCP Card Over-Deployment: Creating
.mcp.json files for every page introduces unnecessary server overhead and maintenance debt. Reserve MCP cards for high-value, data-dense, and frequently cited content types.
- Treating
llms.txt as Set-and-Forget: Site architecture changes (new directories, renamed sections) break llms.txt rules if not updated. Treat it as a living configuration file that syncs with your CMS routing.
- Ignoring Crawl Verification: Assuming ingestion without checking server logs or testing with
curl leads to false confidence. Always validate 200 OK responses and monitor crawl frequency before expecting citation shifts.
- Relying Solely on
llms.txt for Policy Enforcement: llms.txt is currently a proposal. Most major AI vendors still prioritize robots.txt for access control. Use both in tandem for maximum compatibility.
Deliverables
- Agent-Readiness Blueprint: A step-by-step architectural guide mapping your CMS, server configuration, and content taxonomy to AI crawler requirements. Includes routing rules for
/.well-known/llms.txt, JSON-LD injection points, and MCP card generation workflows.
- Implementation Checklist: A technical validation list covering
robots.txt allow-lists, schema.org markup validation, MCP card URL conventions, log monitoring setup, and curl verification protocols.
- Configuration Templates: Ready-to-deploy
llms.txt, robots.txt, and JSON-LD snippet templates tailored for Article, Product, FAQPage, and HowTo content types, with environment-specific variables for staging/production sync.