What Is ai-catalog.json? The New Standard for Making Your Product Discoverable to AI Agents

Summary

The ai-catalog.json file is a new open standard from Google and Microsoft that acts like a sitemap.xml for AI agents, allowing them to discover and use your product's capabilities.
This file is critical for visibility on the emerging "agentic web," enabling your tools to be found and called by AI agents through dedicated registries.
The representativeQueries field is the most important for discovery, requiring task-based phrases that mirror how an AI agent searches for a tool to solve a problem.
Navigating new standards like Agentic Resource Discovery is complex; Synscribe helps companies implement technical frameworks like ai-catalog.json to gain a competitive edge in AI search.

If sitemap.xml tells Google what pages you have, ai-catalog.json tells AI agents what capabilities you offer. It's a machine-readable file you host on your own domain that describes your tools, APIs, and agents in a structured format so AI agent registries can discover and index them automatically.

This file is the centerpiece of the Agentic Resource Discovery (ARD) specification, an open standard from Google, Microsoft, and Hugging Face. This post covers exactly what ai-catalog.json is, what fields go in it, where to host it, why representativeQueries is the most important field you've never heard of, and how it fits into the broader agentic discovery stack alongside llms.txt and AGENTS.md.

What Is ai-catalog.json?

ai-catalog.json is a JSON file you host at /.well-known/ai-catalog.json on your domain. It describes your product's capabilities — not pages, not marketing copy, but the actual things an AI agent can call: APIs, MCP servers, A2A agents, skills.

Here's the mechanism:

You publish the file at the well-known path.
ARD registries — crawlers that index agent capabilities, analogous to how Googlebot indexes pages — automatically discover and fetch it.
When an AI agent queries a registry for a specific capability ("find me a tool to send transactional emails"), the registry surfaces your entry if it's a semantic match.
The agent can then inspect and invoke your tool at runtime, with no pre-installation required.

This is a fundamental shift from the current reality: most so-called AI integrations are just "dumping catalog text into embeddings" once and calling it done. That approach breaks the moment inventory changes or prices update — it's stale data by design. ai-catalog.json moves the model to structured data + real-time function calls, which is what makes agents reliable, as this discussion shows.

Who's behind it?

The ARD spec is Apache 2.0 licensed and currently at v0.9 draft. Co-authors are Junjie Bu (Google), R.V. Guha (Microsoft), and Shaun Smith (Hugging Face). The acknowledgements include AWS, Cisco, GitHub, Salesforce, Snowflake, and Nvidia — broad industry backing from day one.

The sitemap analogy, explained

	`sitemap.xml`	`ai-catalog.json`
What it declares	Your pages	Your capabilities
Who reads it	Search engine crawlers	ARD registry crawlers
Hosted at	`/sitemap.xml` or `/sitemap_index.xml`	`/.well-known/ai-catalog.json`
Format	XML	JSON
Auto-discovered	Yes	Yes

Both are declarative, static (or statically served), and crawled automatically once discoverable. The parallel is intentional — the ARD spec was designed to feel familiar.

How is it different from `llms.txt`?

They serve different layers and are complementary, not competing:

llms.txt is for the retrieval layer — it guides agents that are already browsing your site, pointing them to the best documentation to read.
ai-catalog.json is for the registry layer — it lets agents find your tools before they ever visit your site, through a structured programmatic directory.

If you already have llms.txt, you still need ai-catalog.json for registry-based discoverability. If you only have ai-catalog.json, agents that land on your site directly won't have the retrieval guidance they need. You want both.

Anatomy of an ai-catalog.json File

Below is a complete, annotated example for a fictional email API product called Sendcraft.

{
  "specVersion": "1.0",
  "host": {
    "displayName": "Sendcraft",
    "identifier": "did:web:sendcraft.com"
  },
  "entries": [
    {
      "identifier": "urn:ai:sendcraft.com:server:email",
      "displayName": "Sendcraft Email API",
      "type": "application/mcp-server+json",
      "url": "https://api.sendcraft.com/mcp.json",
      "description": "Send transactional and marketing emails with delivery tracking.",
      "representativeQueries": [
        "send a welcome email to a new user",
        "track whether an email was delivered or bounced",
        "create a reusable email template"
      ],
      "capabilities": ["SendEmail", "TrackDelivery", "ManageTemplates"],
      "tags": ["email", "transactional", "notifications"]
    }
  ]
}

Field-by-Field Breakdown

specVersion. The version of the AI Catalog standard you're conforming to. Currently "1.0". Always set this explicitly.
host. Describes you, the publisher.
- displayName — your human-readable brand name.
- identifier — a machine-readable, globally unique ID. Use a Decentralized Identifier (DID) anchored to your domain: did:web:yourdomain.com. This is the simplest valid form and requires no external registration.
entries. An array where each object is a single capability—one tool, one API, one agent. A single ai-catalog.json can list multiple entries.
identifier. (The most important field) The capability's permanent global ID. Format: urn:ai:<your-domain>:<namespace>:<name>. This URN is stable even if your infrastructure changes. If you move your MCP server to a new URL next year, the identifier stays the same—only the url field changes. Never use an HTTP URL here.
type. The media type of the artifact. Tells the agent what kind of thing it's about to load:
- application/mcp-server+json — an MCP server
- application/a2a-agent-card+json — a Google A2A agent
- application/ai-skill — an agent skill (SKILL.md format)
url vs data. Two mutually exclusive ways to provide the full capability definition. Use exactly one, never both.
- url — a link to the full artifact (e.g. your mcp.json or OpenAPI spec). Preferred. Keeps your catalog file lightweight.
- data — embeds the full artifact JSON inline. Only appropriate for very small, static artifacts.
representativeQueries. (The most underrated field—see Section 4) 2–5 natural language phrases describing how an AI agent would request your tool. This is the primary signal registries use for semantic ranking.
- ❌ "email sending API", "SMTP integration" — these are product names, not tasks
- ✅ "send a welcome email to a new user", "check if an email bounced" — these are agent tasks
capabilities. A string array of the specific functions or tools your server exposes. It enables fast structured filtering by registries without them having to fetch and parse the full artifact at your url.
tags. Additional keywords for structured filtering and categorization. Broadens your surface area in faceted registry search.

Section 3: Where to Host It and How Agents Find It

The ARD spec defines four discovery mechanisms. Registry crawlers use these in order of priority.

1. Well-Known URI (Primary)

Host the file at:

https://yourdomain.com/.well-known/ai-catalog.json

Crawlers check this path automatically, the same way they check robots.txt. No configuration needed beyond placing the file there. This is the single most important step.

2. Agentmap Directive in robots.txt (Secondary)

Add one line to your existing robots.txt:

# robots.txt
User-agent: *
Disallow: /private/

Agentmap: https://yourdomain.com/.well-known/ai-catalog.json

This acts as an explicit pointer for crawlers that parse robots.txt before checking the well-known path. If you're hosting the file at the standard location, the URL here will be the same — but the directive matters for crawlers that might not auto-check the well-known path.

3. HTML `<link>` Tag (Tertiary)

In your website's <head>:

<head>
  <link rel="ai-catalog" href="https://yourdomain.com/.well-known/ai-catalog.json">
</head>

This is the agentic equivalent of a canonical tag — useful for web-crawling agents that arrive at your homepage and inspect the HTML before querying a registry.

4. DNS Service Binding (Enterprise)

For large organizations with multiple subdomains or complex infrastructure, the ARD spec supports publishing DNS records that point to the catalog. This is an advanced path — don't start here unless you have a specific need.

Practical advice

Start with methods 1 and 2. They cover approximately 95% of crawler discovery scenarios. Method 3 adds a small amount of coverage. Method 4 is for enterprise environments only.

Hosting requirements

Publicly accessible. No authentication, no login walls.
Content-Type: application/json. The HTTP header must be set correctly.
No JS rendering required. The file must be returned as raw JSON, not rendered client-side.

The file can be static (just drop it in your /.well-known/ directory) or dynamically generated by your server, as long as it meets the above requirements.

Validate before submitting

Use the official ARD conformance CLI from the ard-spec repository to confirm your file is valid:

# Clone the official spec repository
git clone https://github.com/ards-project/ard-spec

# Run the conformance test against your live URL
./conformance/bin/conformance-test manifest https://yourdomain.com/.well-known/ai-catalog.json

Fix any errors before your file is crawled. An invalid file that returns a 200 status can still be skipped or rejected by registries if it doesn't conform to the schema.

Why `representativeQueries` Is the SEO of the Agentic Web

Most developers will fill out representativeQueries in about 30 seconds with a few product feature names and move on. That's a significant mistake — this field determines your ranking in agent registry search results.

Here's how it works technically: ARD registries take the phrases in your representativeQueries array and convert them into vector embeddings. When an AI agent queries the registry with a natural language task — say, "find me a tool to send transactional emails" — the registry performs a semantic similarity search against those embeddings to rank results.

If your queries are feature names like "email API" or "SMTP service", the semantic distance between your entry and the agent's task-phrased query will be high. You won't rank at the top, even if your tool is the best fit.

This is task-phrase research — the direct equivalent of keyword research in traditional SEO. Tools like Synscribe's LLM Keyword Platform can help identify these task-based phrases by analyzing search intent across both traditional and AI search engines, giving you a head start.

The strategy

Think like an agent completing a task, not a marketer describing a product.

Agents don't query registries with product names. They query with tasks: "I need to do X." Your representativeQueries need to mirror that framing.

❌ Feature-name framing	✅ Task-phrase framing
`"email sending API"`	`"send a welcome email to a new user"`
`"SMTP integration"`	`"check if a transactional email was delivered"`
`"template management"`	`"create a reusable email template for onboarding"`

Use imperative voice. "Send X", "create Y", "check if Z", "retrieve the current status of X." This directly mirrors how agents formulate internal task descriptions.

Cover distinct use cases. Don't write five variations of "send an email." Cover the breadth of what your tool does — sending, tracking, templating, bounces. Each distinct query expands your semantic coverage in the registry index.

Ask yourself: "What would a coding agent type into a tool registry when it needs to solve this specific problem?" That's your representativeQuery.

The analogy

representativeQueries is to agent registries what <meta name="description"> was to early Google search results — it's the text that determines whether the "user" (here, the AI agent) decides your result is relevant enough to open. Except in this case, the stakes are higher: if the agent doesn't open your result, it never discovers your tool exists at all.

This is the difference between being discoverable and being chosen. Invest the time here.

How ai-catalog.json Fits the Bigger Picture

Publishing ai-catalog.json is an important step, but it's one layer in a multi-layer stack. Understanding where it sits in the agent decision pipeline helps you avoid over-indexing on it — and helps you identify what else needs to be in place.

The agent decision pipeline (simplified)

When an AI agent needs an external tool, it runs through a pipeline to find one.

Training prior. What does the base model already know? If your product is well-known and frequently discussed in training data, the model may reach for it directly.
Web search + fetch. The agent searches for tools in your category. Research suggests agents visit only ~6% of surfaced domains—most results are skipped after the snippet.
Retrieval. The agent visits your site, checks for llms.txt, and reads your documentation. This is where llms.txt lives.
Registry query — This is where ai-catalog.json fires. The agent queries a trusted ARD registry for a tool matching its need. Google's Agent Registry (part of the Gemini Enterprise Agent Platform) is the flagship enterprise implementation of this layer. If your ai-catalog.json is indexed, you surface here.
Environment — Files like AGENTS.md or CLAUDE.md in a code repository are loaded at session start, before anything else. These provide explicit instructions and pre-loaded tool definitions.

Where `ai-catalog.json` gives you the most leverage

Step 4 (registry query) is the most direct discovery path. Unlike web search, where an agent might encounter hundreds of results and only visit a handful, a registry query is targeted and structured. If your ai-catalog.json is well-formed and your representativeQueries are strong, you can surface in front of agents that would never have found you through web search.

But step 4 doesn't replace the others. A company that only publishes ai-catalog.json but has no llms.txt, no AGENTS.md, and no agent-readable documentation will be discoverable but not chosen. The selection decision — which tool the agent actually calls — happens across all five steps, and steps 1–3 often have more bearing on the final pick than registry discovery alone.

The complete implementation stack

File	Layer	What it does
`/.well-known/ai-catalog.json`	Registry	Structured capability declaration for ARD registries
`/llms.txt`	Retrieval	Guides agents browsing your site to the right docs
`AGENTS.md` (in repo root)	Environment	Pre-session context for agents working in your codebase

These serve different parts of the pipeline. Implement all three for full coverage. For a detailed breakdown of the full stack and how each file interacts with agent decision-making, see the Agentic Discovery Playbook.

Quick Reference: ai-catalog.json Implementation Checklist

Use this to make sure you've covered everything before your file goes live.

Create ai-catalog.json with specVersion, host (including a did:web: identifier), and at least one entry
Set identifier in urn:ai:<yourdomain>:<namespace>:<name> format for every entry — never use an HTTP URL
Write 2–5 task-phrased representativeQueries per capability (imperative voice, agent task framing)
Use url or data per entry — never both
Host at /.well-known/ai-catalog.json — public, no auth, Content-Type: application/json
Add Agentmap: https://yourdomain.com/.well-known/ai-catalog.json to your robots.txt
Validate with the ARD conformance CLI before considering it done
Add <link rel="ai-catalog" href="..."> to your site's <head> for HTML-layer discoverability

Gain Your Edge on the Agentic Web

ai-catalog.json is your official declaration to the agentic web about what your tools can do. It moves your product from being something an agent might scrape to something an agent can discover on demand through a structured, semantic registry query.

The web had sitemap.xml to make pages discoverable. The agentic web has ai-catalog.json to make capabilities discoverable. The transition is the same—the only question is whether you publish the file before your competitors do. Implementing these new standards is complex but crucial. To ensure your full agentic discovery stack is implemented correctly, explore Synscribe's solutions.

Frequently Asked Questions

What is ai-catalog.json?

ai-catalog.json is a machine-readable file that declares your product's capabilities, like APIs and tools, to AI agents. Hosted on your domain, it allows AI agent registries to automatically discover and index what your service can do. This is a core part of the Agentic Resource Discovery (ARD) specification, acting like a sitemap.xml but for agent capabilities instead of web pages.

Why is having an ai-catalog.json file important for my business?

It makes your tools and APIs discoverable to AI agents through structured registries, like Google's Agent Registry. Without it, your product is invisible to agents performing targeted capability searches. This file is your entry ticket to the agentic web, ensuring your services can be programmatically invoked by AI agents to solve user tasks in real-time.

How does ai-catalog.json differ from sitemap.xml?

While sitemap.xml lists web pages for search engines, ai-catalog.json lists functional capabilities (APIs, tools) for AI agent registries. Both files help with automated discovery, but they serve different consumers and declare different types of resources. One is for human-readable content discovery (pages), while the other is for machine-callable function discovery (capabilities).

Where is the correct place to host the ai-catalog.json file?

The primary and standard location for your ai-catalog.json file is in the .well-known directory at the root of your domain. The full path should be https://yourdomain.com/.well-known/ai-catalog.json. This well-known URI allows ARD registry crawlers to find the file automatically. You should also add an Agentmap directive in your robots.txt pointing to this location.

What makes the representativeQueries field so critical?

The representativeQueries field is the most important factor for ranking in AI agent registry searches. It provides natural language examples of tasks your tool can perform. AI registries use these queries to perform semantic similarity searches against an agent's request. Well-crafted, task-oriented queries are the direct equivalent of keyword research for traditional SEO.

Do I need both ai-catalog.json and llms.txt?

Yes, you should implement both because they serve different layers of the agent discovery process. ai-catalog.json is for the registry layer, allowing agents to discover your tools before visiting your site. llms.txt is for the retrieval layer, guiding agents that are already browsing your site to the correct documentation. They are complementary, not competing.

How can I check if my ai-catalog.json file is valid?

You can validate your file using the official Agentic Resource Discovery (ARD) conformance command-line tool. The tool is available in the official ard-spec GitHub repository. Running the conformance test against your live URL ensures your file meets the schema requirements before it gets crawled by registries.