
ai-catalog.json file is a new open standard from Google and Microsoft that acts like a sitemap.xml for AI agents, allowing them to discover and use your product's capabilities.representativeQueries field is the most important for discovery, requiring task-based phrases that mirror how an AI agent searches for a tool to solve a problem.ai-catalog.json to gain a competitive edge in AI search.If sitemap.xml tells Google what pages you have, ai-catalog.json tells AI agents what capabilities you offer. It's a machine-readable file you host on your own domain that describes your tools, APIs, and agents in a structured format so AI agent registries can discover and index them automatically.
This file is the centerpiece of the Agentic Resource Discovery (ARD) specification, an open standard from Google, Microsoft, and Hugging Face. This post covers exactly what ai-catalog.json is, what fields go in it, where to host it, why representativeQueries is the most important field you've never heard of, and how it fits into the broader agentic discovery stack alongside llms.txt and AGENTS.md.
ai-catalog.json is a JSON file you host at /.well-known/ai-catalog.json on your domain. It describes your product's capabilities — not pages, not marketing copy, but the actual things an AI agent can call: APIs, MCP servers, A2A agents, skills.
Here's the mechanism:
This is a fundamental shift from the current reality: most so-called AI integrations are just "dumping catalog text into embeddings" once and calling it done. That approach breaks the moment inventory changes or prices update — it's stale data by design. ai-catalog.json moves the model to structured data + real-time function calls, which is what makes agents reliable, as this discussion shows.
The ARD spec is Apache 2.0 licensed and currently at v0.9 draft. Co-authors are Junjie Bu (Google), R.V. Guha (Microsoft), and Shaun Smith (Hugging Face). The acknowledgements include AWS, Cisco, GitHub, Salesforce, Snowflake, and Nvidia — broad industry backing from day one.
sitemap.xml | ai-catalog.json | |
|---|---|---|
| What it declares | Your pages | Your capabilities |
| Who reads it | Search engine crawlers | ARD registry crawlers |
| Hosted at | /sitemap.xml or /sitemap_index.xml | /.well-known/ai-catalog.json |
| Format | XML | JSON |
| Auto-discovered | Yes | Yes |
Both are declarative, static (or statically served), and crawled automatically once discoverable. The parallel is intentional — the ARD spec was designed to feel familiar.
llms.txt?They serve different layers and are complementary, not competing:
llms.txt is for the retrieval layer — it guides agents that are already browsing your site, pointing them to the best documentation to read.ai-catalog.json is for the registry layer — it lets agents find your tools before they ever visit your site, through a structured programmatic directory.If you already have llms.txt, you still need ai-catalog.json for registry-based discoverability. If you only have ai-catalog.json, agents that land on your site directly won't have the retrieval guidance they need. You want both.
Below is a complete, annotated example for a fictional email API product called Sendcraft.
{
"specVersion": "1.0",
"host": {
"displayName": "Sendcraft",
"identifier": "did:web:sendcraft.com"
},
"entries": [
{
"identifier": "urn:ai:sendcraft.com:server:email",
"displayName": "Sendcraft Email API",
"type": "application/mcp-server+json",
"url": "https://api.sendcraft.com/mcp.json",
"description": "Send transactional and marketing emails with delivery tracking.",
"representativeQueries": [
"send a welcome email to a new user",
"track whether an email was delivered or bounced",
"create a reusable email template"
],
"capabilities": ["SendEmail", "TrackDelivery", "ManageTemplates"],
"tags": ["email", "transactional", "notifications"]
}
]
}
specVersion. The version of the AI Catalog standard you're conforming to. Currently "1.0". Always set this explicitly.host. Describes you, the publisher. displayName — your human-readable brand name.identifier — a machine-readable, globally unique ID. Use a Decentralized Identifier (DID) anchored to your domain: did:web:yourdomain.com. This is the simplest valid form and requires no external registration.entries. An array where each object is a single capability—one tool, one API, one agent. A single ai-catalog.json can list multiple entries.identifier. (The most important field) The capability's permanent global ID. Format: urn:ai:<your-domain>:<namespace>:<name>. This URN is stable even if your infrastructure changes. If you move your MCP server to a new URL next year, the identifier stays the same—only the url field changes. Never use an HTTP URL here.type. The media type of the artifact. Tells the agent what kind of thing it's about to load: application/mcp-server+json — an MCP serverapplication/a2a-agent-card+json — a Google A2A agentapplication/ai-skill — an agent skill (SKILL.md format)url vs data. Two mutually exclusive ways to provide the full capability definition. Use exactly one, never both. url — a link to the full artifact (e.g. your mcp.json or OpenAPI spec). Preferred. Keeps your catalog file lightweight.data — embeds the full artifact JSON inline. Only appropriate for very small, static artifacts.representativeQueries. (The most underrated field—see Section 4) 2–5 natural language phrases describing how an AI agent would request your tool. This is the primary signal registries use for semantic ranking. "email sending API", "SMTP integration" — these are product names, not tasks"send a welcome email to a new user", "check if an email bounced" — these are agent taskscapabilities. A string array of the specific functions or tools your server exposes. It enables fast structured filtering by registries without them having to fetch and parse the full artifact at your url.tags. Additional keywords for structured filtering and categorization. Broadens your surface area in faceted registry search.The ARD spec defines four discovery mechanisms. Registry crawlers use these in order of priority.
Host the file at:
https://yourdomain.com/.well-known/ai-catalog.json
Crawlers check this path automatically, the same way they check robots.txt. No configuration needed beyond placing the file there. This is the single most important step.
Add one line to your existing robots.txt:
# robots.txt
User-agent: *
Disallow: /private/
Agentmap: https://yourdomain.com/.well-known/ai-catalog.json
This acts as an explicit pointer for crawlers that parse robots.txt before checking the well-known path. If you're hosting the file at the standard location, the URL here will be the same — but the directive matters for crawlers that might not auto-check the well-known path.
<link> Tag (Tertiary)In your website's <head>:
<head>
<link rel="ai-catalog" href="https://yourdomain.com/.well-known/ai-catalog.json">
</head>
This is the agentic equivalent of a canonical tag — useful for web-crawling agents that arrive at your homepage and inspect the HTML before querying a registry.
For large organizations with multiple subdomains or complex infrastructure, the ARD spec supports publishing DNS records that point to the catalog. This is an advanced path — don't start here unless you have a specific need.
Start with methods 1 and 2. They cover approximately 95% of crawler discovery scenarios. Method 3 adds a small amount of coverage. Method 4 is for enterprise environments only.
Content-Type: application/json. The HTTP header must be set correctly.The file can be static (just drop it in your /.well-known/ directory) or dynamically generated by your server, as long as it meets the above requirements.
Use the official ARD conformance CLI from the ard-spec repository to confirm your file is valid:
# Clone the official spec repository
git clone https://github.com/ards-project/ard-spec
# Run the conformance test against your live URL
./conformance/bin/conformance-test manifest https://yourdomain.com/.well-known/ai-catalog.json
Fix any errors before your file is crawled. An invalid file that returns a 200 status can still be skipped or rejected by registries if it doesn't conform to the schema.
representativeQueries Is the SEO of the Agentic WebMost developers will fill out representativeQueries in about 30 seconds with a few product feature names and move on. That's a significant mistake — this field determines your ranking in agent registry search results.
Here's how it works technically: ARD registries take the phrases in your representativeQueries array and convert them into vector embeddings. When an AI agent queries the registry with a natural language task — say, "find me a tool to send transactional emails" — the registry performs a semantic similarity search against those embeddings to rank results.
If your queries are feature names like "email API" or "SMTP service", the semantic distance between your entry and the agent's task-phrased query will be high. You won't rank at the top, even if your tool is the best fit.
This is task-phrase research — the direct equivalent of keyword research in traditional SEO. Tools like Synscribe's LLM Keyword Platform can help identify these task-based phrases by analyzing search intent across both traditional and AI search engines, giving you a head start.
Think like an agent completing a task, not a marketer describing a product.
Agents don't query registries with product names. They query with tasks: "I need to do X." Your representativeQueries need to mirror that framing.
| ❌ Feature-name framing | ✅ Task-phrase framing |
|---|---|
"email sending API" | "send a welcome email to a new user" |
"SMTP integration" | "check if a transactional email was delivered" |
"template management" | "create a reusable email template for onboarding" |
Use imperative voice. "Send X", "create Y", "check if Z", "retrieve the current status of X." This directly mirrors how agents formulate internal task descriptions.
Cover distinct use cases. Don't write five variations of "send an email." Cover the breadth of what your tool does — sending, tracking, templating, bounces. Each distinct query expands your semantic coverage in the registry index.
Ask yourself: "What would a coding agent type into a tool registry when it needs to solve this specific problem?" That's your representativeQuery.
representativeQueries is to agent registries what <meta name="description"> was to early Google search results — it's the text that determines whether the "user" (here, the AI agent) decides your result is relevant enough to open. Except in this case, the stakes are higher: if the agent doesn't open your result, it never discovers your tool exists at all.
This is the difference between being discoverable and being chosen. Invest the time here.
Publishing ai-catalog.json is an important step, but it's one layer in a multi-layer stack. Understanding where it sits in the agent decision pipeline helps you avoid over-indexing on it — and helps you identify what else needs to be in place.
When an AI agent needs an external tool, it runs through a pipeline to find one.
Training prior. What does the base model already know? If your product is well-known and frequently discussed in training data, the model may reach for it directly.
Web search + fetch. The agent searches for tools in your category. Research suggests agents visit only ~6% of surfaced domains—most results are skipped after the snippet.
Retrieval. The agent visits your site, checks for llms.txt, and reads your documentation. This is where llms.txt lives.
Registry query — This is where ai-catalog.json fires. The agent queries a trusted ARD registry for a tool matching its need. Google's Agent Registry (part of the Gemini Enterprise Agent Platform) is the flagship enterprise implementation of this layer. If your ai-catalog.json is indexed, you surface here.
Environment — Files like AGENTS.md or CLAUDE.md in a code repository are loaded at session start, before anything else. These provide explicit instructions and pre-loaded tool definitions.
ai-catalog.json gives you the most leverageStep 4 (registry query) is the most direct discovery path. Unlike web search, where an agent might encounter hundreds of results and only visit a handful, a registry query is targeted and structured. If your ai-catalog.json is well-formed and your representativeQueries are strong, you can surface in front of agents that would never have found you through web search.
But step 4 doesn't replace the others. A company that only publishes ai-catalog.json but has no llms.txt, no AGENTS.md, and no agent-readable documentation will be discoverable but not chosen. The selection decision — which tool the agent actually calls — happens across all five steps, and steps 1–3 often have more bearing on the final pick than registry discovery alone.
| File | Layer | What it does |
|---|---|---|
/.well-known/ai-catalog.json | Registry | Structured capability declaration for ARD registries |
/llms.txt | Retrieval | Guides agents browsing your site to the right docs |
AGENTS.md (in repo root) | Environment | Pre-session context for agents working in your codebase |
These serve different parts of the pipeline. Implement all three for full coverage. For a detailed breakdown of the full stack and how each file interacts with agent decision-making, see the Agentic Discovery Playbook.
Use this to make sure you've covered everything before your file goes live.
ai-catalog.json with specVersion, host (including a did:web: identifier), and at least one entryidentifier in urn:ai:<yourdomain>:<namespace>:<name> format for every entry — never use an HTTP URLrepresentativeQueries per capability (imperative voice, agent task framing)url or data per entry — never both/.well-known/ai-catalog.json — public, no auth, Content-Type: application/jsonAgentmap: https://yourdomain.com/.well-known/ai-catalog.json to your robots.txt<link rel="ai-catalog" href="..."> to your site's <head> for HTML-layer discoverabilityai-catalog.json is your official declaration to the agentic web about what your tools can do. It moves your product from being something an agent might scrape to something an agent can discover on demand through a structured, semantic registry query.
The web had sitemap.xml to make pages discoverable. The agentic web has ai-catalog.json to make capabilities discoverable. The transition is the same—the only question is whether you publish the file before your competitors do. Implementing these new standards is complex but crucial. To ensure your full agentic discovery stack is implemented correctly, explore Synscribe's solutions.
ai-catalog.json is a machine-readable file that declares your product's capabilities, like APIs and tools, to AI agents. Hosted on your domain, it allows AI agent registries to automatically discover and index what your service can do. This is a core part of the Agentic Resource Discovery (ARD) specification, acting like a sitemap.xml but for agent capabilities instead of web pages.
It makes your tools and APIs discoverable to AI agents through structured registries, like Google's Agent Registry. Without it, your product is invisible to agents performing targeted capability searches. This file is your entry ticket to the agentic web, ensuring your services can be programmatically invoked by AI agents to solve user tasks in real-time.
While sitemap.xml lists web pages for search engines, ai-catalog.json lists functional capabilities (APIs, tools) for AI agent registries. Both files help with automated discovery, but they serve different consumers and declare different types of resources. One is for human-readable content discovery (pages), while the other is for machine-callable function discovery (capabilities).
The primary and standard location for your ai-catalog.json file is in the .well-known directory at the root of your domain. The full path should be https://yourdomain.com/.well-known/ai-catalog.json. This well-known URI allows ARD registry crawlers to find the file automatically. You should also add an Agentmap directive in your robots.txt pointing to this location.
The representativeQueries field is the most important factor for ranking in AI agent registry searches. It provides natural language examples of tasks your tool can perform. AI registries use these queries to perform semantic similarity searches against an agent's request. Well-crafted, task-oriented queries are the direct equivalent of keyword research for traditional SEO.
Yes, you should implement both because they serve different layers of the agent discovery process. ai-catalog.json is for the registry layer, allowing agents to discover your tools before visiting your site. llms.txt is for the retrieval layer, guiding agents that are already browsing your site to the correct documentation. They are complementary, not competing.
You can validate your file using the official Agentic Resource Discovery (ARD) conformance command-line tool. The tool is available in the official ard-spec GitHub repository. Running the conformance test against your live URL ensures your file meets the schema requirements before it gets crawled by registries.
Synscribe helps B2B companies with SEO & GEO using programmatic SEO approach. Book a call to find out how we help you win.