9 min read

I gave my blog a Markdown twin for agents

I don’t have many readers. I have no evidence any LLM cites my blog, probably none do, not yet. So this wasn’t a traffic play. I wanted to understand how the “agent-readable web” stuff works, and a personal blog is a cheap place to learn by doing it for real.

The premise is simple. When an agent fetches one of my posts, it downloads my HTML: <div> soup, CSS classes, inline scripts, hydration noise, navigation chrome. The prose I wrote is a sliver of the bytes. The same post in Markdown runs 60–90% smaller. So the goal: hand agents clean Markdown, give them a map of the site, and advertise that both exist, all on a static site with no server and no edge functions.

I’ll walk through everything I built. Stack is Astro 5 with content collections, but the shape ports to any static generator.

The one decision that shapes everything

There are two ways to serve Markdown to an agent:

  • Content negotiation: same URL, return Markdown when the request sends Accept: text/markdown. Clean, but it needs a server or edge function to branch on the header.
  • .md suffix: publish a second static file at /blog/post.md next to /blog/post/. Pure files, works on any static host.

My site is static, so the .md-suffix route wins by default. It covers the same need with zero runtime. If you’re on Cloudflare, the easy path is to flip on their Markdown for Agents toggle. It serves a Markdown version of any page on Accept: text/markdown with no code at all. Doing it by hand is ~80 lines, but you learn what’s happening and you’re not tied to one host.

Step 1 — A shared Markdown serializer

Three places need to turn a post into clean Markdown: the per-page .md files, llms-full.txt, and the RSS feed. Write the serializer once. My posts are MDX, so they can contain import lines and JSX components that mean nothing to an LLM. Strip those out.

src/lib/llm-markdown.ts:

import type { CollectionEntry } from "astro:content";

/** Strip MDX-only syntax so the body is portable CommonMark. */
export function stripMdxSyntax(content: string): string {
  return content
    .replace(/^import\s+.*?from\s+["'].*?["'];?\s*$/gm, "")
    .replace(/^export\s+.*$/gm, "")
    .replace(/<[A-Z][a-zA-Z]*\s*\/>/g, "")
    .replace(/<[A-Z][a-zA-Z]*[^>]*>[\s\S]*?<\/[A-Z][a-zA-Z]*>/g, "")
    .trim();
}

const isoDate = (d: Date) => d.toISOString().slice(0, 10);

/** Render one entry as a self-contained doc: H1, summary, meta, body. */
export function entryToMarkdown(entry: CollectionEntry<"blog">, canonicalURL: string): string {
  const { data, body } = entry;
  const lines = [`# ${data.title}`, ""];
  if (data.description) lines.push(`> ${data.description}`, "");
  lines.push(
    `- **Published:** ${isoDate(data.date)}`,
    ...(data.tags?.length ? [`- **Tags:** ${data.tags.join(", ")}`] : []),
    `- **Canonical:** ${canonicalURL}`,
    "", "---", "",
    stripMdxSyntax(body ?? ""),
  );
  return lines.join("\n") + "\n";
}

The metadata block matters. An agent that fetches one .md file in isolation never sees your HTML <head>, so it has no idea when the post was written or what URL to cite. The H1, blockquote summary, and meta lines give it that inline. The canonical URL is the important one: it’s how the agent attributes the content back to you.

Step 2 — A .md twin for every post

Astro lets a file route emit non-HTML. Name it [...id].md.ts and it generates one static .md per post at build time, mirroring the canonical URL.

src/pages/blog/[...id].md.ts:

import type { APIRoute, GetStaticPaths } from "astro";
import { getCollection, type CollectionEntry } from "astro:content";
import { entryToMarkdown } from "@lib/llm-markdown";

export const getStaticPaths: GetStaticPaths = async () => {
  const posts = (await getCollection("blog")).filter((p) => !p.data.draft);
  return posts.map((post) => ({ params: { id: post.id }, props: { post } }));
};

export const GET: APIRoute = ({ props, site }) => {
  const { post } = props as { post: CollectionEntry<"blog"> };
  const canonicalURL = new URL(`/blog/${post.id}/`, site).toString();
  return new Response(entryToMarkdown(post, canonicalURL), {
    headers: { "Content-Type": "text/markdown; charset=utf-8" },
  });
};

One gotcha worth knowing: in a static build, the Content-Type you set here is advisory. The host serves the physical file by extension. .md resolves to text/markdown on Vercel, Cloudflare, and Netlify. The bytes are Markdown either way, so agents are fine. I only emit .md for blog posts, because those are the only things with an individual canonical URL to mirror. Projects and list pages get covered by the next step.

Step 3 — llms.txt, the site map

One small file an agent fetches first to learn the whole site without crawling it. It follows the llmstxt.org shape: an H1 name, a > summary, ## sections of links, and an ## Optional section agents may skip when they need a shorter context.

src/pages/llms.txt.ts (abridged):

import type { APIRoute } from "astro";
import { getCollection } from "astro:content";
import { SITE } from "@consts";

export const GET: APIRoute = async ({ site }) => {
  const base = site!.toString().replace(/\/$/, "");
  const blog = (await getCollection("blog"))
    .filter((p) => !p.data.draft)
    .sort((a, b) => b.data.date.valueOf() - a.data.date.valueOf());

  const lines = [
    `# ${SITE.TITLE}`, "",
    `> ${SITE.DESCRIPTION}`, "",
    "Every blog post is available as raw Markdown by appending `.md` to its URL.", "",
    "## Blog", "",
    ...blog.map((p) => `- [${p.data.title}](${base}/blog/${p.id}.md): ${p.data.description}`),
    "", "## Optional", "",
    `- [Full site content (single file)](${base}/llms-full.txt)`,
    `- [RSS feed](${base}/rss.xml)`,
  ];
  return new Response(lines.join("\n"), {
    headers: { "Content-Type": "text/plain; charset=utf-8" },
  });
};

The links point at the .md versions, not the HTML pages, so an agent following the map lands straight on the token-light file. That’s the entire point of having a map.

Step 4 — llms-full.txt, the whole site in one fetch

Some agents would rather grab everything once than make twenty requests. src/pages/llms-full.txt.ts maps every post through the same entryToMarkdown serializer and joins them with --- separators. Mine came out around 87 KB, nothing for a context window. Park the link to it in the ## Optional section of llms.txt, since it’s the heavy alternative to the per-page files.

Step 5 — Make all of it discoverable

A .md twin nobody knows about is useless. Three signals advertise it, at three layers.

In the HTML <head>, point at the Markdown twin so an agent that did land on the HTML page can find it:

{markdownURL && (
  <link rel="alternate" type="text/markdown" title="Markdown version" href={markdownURL} />
)}

Use rel="alternate" plus the media type. There’s no IANA-registered relation for Markdown variants or llms.txt. Inventing rel="llms-txt" would be non-standard. alternate is the honest, spec-safe choice.

In an HTTP Link header (RFC 8288), so agents see the pointer in the response headers before downloading the body. Static files can’t emit their own headers; the host does. On Cloudflare Pages or Netlify, public/_headers:

/*
  Link: </llms.txt>; rel="alternate"; type="text/markdown", </sitemap-index.xml>; rel="sitemap"; type="application/xml"

On Vercel, the same Link value under a headers rule for /(.*) in vercel.json. Ship both if your host isn’t pinned in the repo; each ignores the other.

In robots.txt, state your AI-usage preferences with a Content Signal:

# /llms.txt — curated Markdown index of this site
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=yes
Allow: /

Sitemap: https://nbarwicki.com/sitemap-index.xml

search means appear in AI search, ai-input means usable as answer/RAG input, ai-train means usable for training. I allow all three. To permit citation but block training, you’d set ai-train=no. That’s the one knob to flip.

The part to skip — don’t fake capabilities

When I ran my site through an “is your site agent-ready?” scanner, it flagged a pile of missing files: OAuth and OIDC discovery documents, an API catalog, an MCP server card, an auth.md, DNS records advertising agent endpoints.

I skipped every one of them, and you should too. For a content site, they’re not a harmless box-tick. An authorization_endpoint in your metadata is a promise that an auth server exists at that address. An MCP server card is a promise of callable tools. When an agent takes you up on a promise you can’t keep, you haven’t looked more capable. You’ve sent it down a dead end and taught it to distrust your site’s signals.

Only advertise capabilities you can back. The scanner doesn’t know what your site is; it knows what the spec menu offers and flags every blank. Most of that menu is built for sites with APIs, auth, or tools. A blog has none, so most of it is noise and some of it is misleading. A short, honest set of signals beats an impressive pile of metadata pointing at things that aren’t there.

Verify it works

npm run build                      # must pass clean

ls dist/blog/*.md                  # one .md per post
head -30 dist/llms.txt             # the curated map
wc -c dist/llms-full.txt           # whole site in one file
cat dist/_headers                  # Link header present

# And after deploy:
curl -sI https://yoursite.com/ | grep -i '^link:'       # RFC 8288 header
curl -s  https://yoursite.com/blog/<id>.md | head        # token-light post

The satisfying check is the byte diff: curl -s <html-page> | wc -c against the .md version. Expect that 60–90% drop.

The honest part

None of this is a ratified standard. llms.txt is a 2024 proposal; the .md-suffix convention and Content Signals are adopted in practice but guaranteed by no one. robots.txt and Content Signals are preferences, not enforcement. Cloudflare has documented crawlers fetching content through undeclared, rotating user-agents that ignore the rules. You state intent; you don’t get compliance.

And adoption is thin. The big doc platforms publish llms.txt, but no major LLM provider has committed to consuming it. So I won’t pretend this drove traffic. I have no readers to speak of and no proof an agent ever reached for the .md. I came out with a static site that’s now cheap to read, the build code to do it again, and a clear sense of which half of the “agent-ready” checklist is real. For an evening of work, that was the payoff.