The Challenge
A content-heavy corporate website with over 70 pages and a dynamic blog was generating significant traffic from AI agents — ChatGPT, Perplexity, Claude, and various automated crawlers. However, these agents were receiving the same raw HTML served to browsers: full of navigation bars, cookie consent banners, footer links, inline scripts, SVG icons, and styling markup. Analysis showed that approximately 80% of the tokens AI agents consumed were noise — markup that contributed nothing to understanding the page's actual content. The agents were wasting compute, the site's content was being poorly represented in AI-generated summaries, and there was no mechanism to signal content permissions to these automated visitors.
The Approach
We designed and implemented a structured content negotiation strategy:
- Content Audit: Mapped all 70+ static pages and dynamic blog posts, identifying which elements constitute content vs. chrome
- HTTP Content Negotiation: Implemented Accept header detection at the nginx level — agents sending Accept: text/markdown receive clean markdown instead of HTML
- Hybrid Architecture: Pre-generated markdown for static pages (which change rarely) combined with real-time conversion for blog posts (which update frequently)
- Metadata Enrichment: Added YAML frontmatter with title, description, canonical URL, language, and keywords to every markdown response
- Permission Signalling: Implemented X-Content-Signal headers to communicate usage permissions to AI agents
The Solution
A CDN-independent Markdown for Agents implementation using nginx content negotiation and a FastAPI conversion service. AI agents now receive clean, structured markdown with rich metadata, while human visitors continue to see the full HTML experience unchanged.
Architecture
Detection Layer
nginx map directive inspects Accept headers for text/markdown preference
Static Cache
Pre-generated markdown files for 70+ pages, regenerated nightly via scheduler
Dynamic Conversion
Real-time HTML-to-markdown conversion for blog posts via FastAPI service
Response Layer
Clean markdown with YAML frontmatter, token count headers, and permission signalling
Results
- Token consumption for AI agents reduced by approximately 80%
- 70+ pages serving clean markdown via content negotiation
- Blog posts converting in real-time with zero additional latency perceptible to agents
- llms.txt and llms-full.txt complementing the content negotiation layer
- Zero impact on human visitor experience — same HTML, same performance
- Nightly regeneration keeping markdown cache synchronised with site changes
Facing similar challenges?
Every organisation's situation is unique. Let's discuss how we can help with yours.
Start the Conversation