Making a Content-Heavy Site AI-Readable

The Challenge

A content-heavy corporate website with over 70 pages and a dynamic blog was generating significant traffic from AI agents — ChatGPT, Perplexity, Claude, and various automated crawlers. However, these agents were receiving the same raw HTML served to browsers: full of navigation bars, cookie consent banners, footer links, inline scripts, SVG icons, and styling markup. Analysis showed that approximately 80% of the tokens AI agents consumed were noise — markup that contributed nothing to understanding the page's actual content. The agents were wasting compute, the site's content was being poorly represented in AI-generated summaries, and there was no mechanism to signal content permissions to these automated visitors.

The Approach

We designed and implemented a structured content negotiation strategy:

Content Audit: Mapped all 70+ static pages and dynamic blog posts, identifying which elements constitute content vs. chrome
HTTP Content Negotiation: Implemented Accept header detection at the nginx level — agents sending Accept: text/markdown receive clean markdown instead of HTML
Hybrid Architecture: Pre-generated markdown for static pages (which change rarely) combined with real-time conversion for blog posts (which update frequently)
Metadata Enrichment: Added YAML frontmatter with title, description, canonical URL, language, and keywords to every markdown response
Permission Signalling: Implemented X-Content-Signal headers to communicate usage permissions to AI agents

The Solution

A CDN-independent Markdown for Agents implementation using nginx content negotiation and a FastAPI conversion service. AI agents now receive clean, structured markdown with rich metadata, while human visitors continue to see the full HTML experience unchanged.

Architecture

Detection Layer

nginx map directive inspects Accept headers for text/markdown preference

Static Cache

Pre-generated markdown files for 70+ pages, regenerated nightly via scheduler

Dynamic Conversion

Real-time HTML-to-markdown conversion for blog posts via FastAPI service

Response Layer

Clean markdown with YAML frontmatter, token count headers, and permission signalling

Results

Token consumption for AI agents reduced by approximately 80%
70+ pages serving clean markdown via content negotiation
Blog posts converting in real-time with zero additional latency perceptible to agents
llms.txt and llms-full.txt complementing the content negotiation layer
Zero impact on human visitor experience — same HTML, same performance
Nightly regeneration keeping markdown cache synchronised with site changes

Facing similar challenges?

Every organisation's situation is unique. Let's discuss how we can help with yours.

Start the Conversation