> ## Documentation Index
> Fetch the complete guide index at: https://www.synscribe.com/agentic-discovery/llms.txt
> Use this file to discover all pages before exploring further.

---
title: "Measure AI Visibility: A Metric for Each of the 4 Stages"
description: "How to measure AI visibility one metric per stage (find, research, shortlist, act), and why citation trackers miss what agents actually choose."
slug: /agentic-discovery/measure-ai-visibility
series: The Agentic Discovery Playbook · Part 5 of 6
last_verified: 2026-06-12
---

# Measuring Agentic Visibility (and Proving It Worked)

> **In short:** Measure one number for each step of the agent's journey: **Find** (are you surfaced and ranked?), **Research** (do agents open, read, and trust your docs?), **Shortlist** (does your presence flip the pick?), **Act** (can an agent integrate you end-to-end, and do your rules stick?). Most of these you read from your own surfaces. The open-web part of Find you measure by instrumenting the agent's own run. Citation trackers only count mentions in answers, and mention is not selection.

![The four findings layers where you can intervene in an agent’s research, from page up to user. Per-page (per fetch): claims extracted from a single opened URL; survive the fetch cut and be verifiable. Per-search (per query): a query’s ranked results; rank for a self-authored query. Per-agent (per subagent): a confidence-graded brief; win a sub-category. Synthesis (per conversation): the final answer to the user; be in the verdict.](/agentic-discovery/images/diagram-findings-layers.png "Four layers an agent’s research rolls up through, and where you can intervene at each.")

## Do this now

- [ ] Stand up a weekly tracker: your Context7 search-API position for 5 target task queries, plus your entry's benchmark score and "Updated" field.
- [ ] Add the npm weekly-downloads line for your MCP package. It's the honest ledger every other number calibrates against.
- [ ] Add skills.sh install counts (treat as order-of-magnitude) and a VS Code gallery presence check.
- [ ] Set a freshness alert: entry shows more than 7 days since update → trigger a re-parse.
- [ ] Schedule eval re-runs: monthly, plus within 2 weeks of every major model release.
- [ ] Gate docs and rules-file changes in CI on non-regressing eval deltas.
- [ ] If you pay for an AI citation tracker, rename that dashboard column from "AI visibility" to "mentions."
- [ ] Once a quarter, instrument an agent's open-web "help me choose" run for your category (Birdseye, or manually): log the queries it writes, the pages it opens, and whether your claims survive verification.
- [ ] Date-stamp every snapshot and treat any single reading as ±10%.

> 📥 **Free resource:** [AI Visibility Tracker spreadsheet template](/agentic-discovery/resources/visibility-tracker-template)

Most teams asking "how do we track AI visibility?" buy a citation tracker, watch a mentions graph, and call it measurement. That instrument watches one channel, the answers chatbots give humans. It misses the channel where agents read docs, pick a product, and write the integration. This page is the measurement system for that second channel: what to count at each of the four stages, how often, and which numbers actually move when your [playbook work](/agentic-discovery/agentic-discovery-playbook) lands.

One honesty note up front, repeated wherever we cite our experiments: E1–E5 are pilot-grade trials (single model family, n=2–3 per arm) run 2026-06-11. They establish direction, not population estimates, and we label them that way every time.

## Why is "AI visibility" four different numbers?

Because the agent's four stages (find, research, shortlist, act) each have their own source of truth. A single score hides which stage is broken. Map every metric back to the stage it proves:

| Stage | Question it answers | Core metrics | Source of truth | Plays |
|---|---|---|---|---|
| **Find** (open web) | Do agents surface and trust you in their *own* web search? | Surfaced-rate for agent queries; claim-survival under verification | Agent-run instrumentation (Birdseye) | [Play 1](/agentic-discovery/ai-agent-web-search-and-fetch) |
| **Find** (indexes) | When an agent searches a structured index, do you appear, and first? | Search position for 5 task queries; registry/listing presence | Docs-retrieval index (e.g. Context7) search API | [Plays 2, 11](/agentic-discovery/ai-agent-registries-and-directories) |
| **Research** | When agents open your docs, do they get clean, current, convincing text? | Fetch/open rate; benchmark score; freshness ("Updated" field); llms.txt + `.md` probe pass rate | Retrieval-index entry page + agent-run trace | [Plays 5–7](/agentic-discovery/llms-txt) |
| **Shortlist** | When it compares the finalists, does your presence change the pick? | Flip rate (choice with vs. without your surfaces); operability signals an agent can see | Flip-rate eval | [Plays 3, 4, 6, 7](/agentic-discovery/mcp-server-distribution) |
| **Act** | Can an agent integrate you end-to-end, correctly, and do your rules persist? | Deprecation-eval emission rate (<5%); integration-completion; time-to-first-successful-API-call; skills.sh installs; npm weekly downloads | Your eval harness + skills.sh + npm | [Plays 8–11](/agentic-discovery/stop-ai-using-deprecated-apis) |

Our pilot experiments rank the later stages highest for *selection* impact: **Act > Shortlist > Research > Find** (E1's 3/3 rules-file flip and E3's 100%→0% directive fix are both Act-stage; E4's operability flip is Shortlist). So weight your dashboard the same way. A skills-install or flip-rate trend deserves more attention than a mentions graph. **Find is the entry gate** (you can't be researched, shortlisted, or integrated if the agent never finds you), and its open-web half is measured differently from the rest, because it depends on the open web, not your own surfaces.

## How do you measure the search/fetch surface? (instrument the agent)

You can't see this layer from your own analytics. Your server logs a bare `WebFetch` hit with no query, no competitors, no verdict. So you measure it by instrumenting the agent's run instead. Our free Mac app **[Birdseye](/agentic-discovery/resources/birdseye)** (agent-observability) reconstructs a research run as four debuggable findings layers, and each pinpoints a different reason you did or didn't get chosen:

| Findings layer | What it shows | Your failure mode if you're missing |
|---|---|---|
| **Per-search** | The agent's self-authored queries and the domains each surfaced | Not surfaced (a ranking/recency problem) |
| **Per-page** | Which domains it actually fetched and the claims it extracted | Surfaced but not opened (a snippet/authority problem) |
| **Per-agent** | A sub-agent's confidence-graded brief on its slice | Lost a sub-category to a better-covered competitor |
| **Synthesis** | The final recommendation and the claims behind it | Out-positioned, or your claim was killed in verification |

To approximate it today without the tool: give an agent your category's "help me choose" task with web search on, and record the queries it writes, the pages it opens, and whether your claims survive. That's the manual version of the four-layer trace. [Play 1](/agentic-discovery/ai-agent-web-search-and-fetch) covers what to fix at each layer. <!-- EXT: Birdseye self-serve search/fetch instrumentation: slot for tool launch -->

## What goes in the weekly tracker?

Five automated checks, one quarterly audit. This is the measurement plan from our [directory-landscape study](/agentic-discovery/ai-agent-registries-and-directories), as of 2026-06-11:

| # | Check | Cadence | Alert |
|---|---|---|---|
| 1 | Context7 search-API position for your 5 target task queries, plus your entry's benchmark and freshness fields | Weekly | Position drops; benchmark −3 pts; "Updated" > 7 days |
| 2 | skills.sh install counts for your skills repo | Weekly | Trend reversal (order-of-magnitude only; counts observed cache-inconsistent) |
| 3 | npm weekly downloads of your MCP package | Weekly | The honest ledger: calibrate every other number against it |
| 4 | VS Code gallery / GitHub MCP Registry presence check | Weekly | Listing missing or stale |
| 5 | PulseMCP visitor estimate | Weekly | Secondary trend line only (explicitly estimates) |
| 6 | Re-verify the registry tier table itself | Quarterly | The landscape reshuffled twice in the past year |

Why npm is the ledger: Context7's MCP shows 1.14M weekly npm installs but only 6.8K visible "uses" on Smithery. Roughly 99.4% of real distribution bypasses standalone directories. Directory dashboards flatter or starve you. Package installs don't.

Why weekly, not quarterly: openclaw lost 50% of its retrieval share in 30 days while still ranked #10. This layer moves in weeks. And treat any single snapshot as ±10%. We observed Stripe's stripe-node snippet count read 130 on one Context7 surface and 207 on another the same day.

## How often should you re-run evals?

The four eval suites (deprecation, flip-rate, integration-completion, and the 25-question doc eval) are specified in [Play 11](/agentic-discovery/ai-evals-and-leaderboards). The cadence that keeps them meaningful:

1. **Within 2 weeks of every major model release.** A new frontier model resets what agents memorize and flub; your stale window moves.
2. **Monthly otherwise.** Drift happens without releases: index re-parses, docs edits, registry changes.
3. **On every docs or rules-file change** (internal suites only), wired as a CI gate: surface changes ship only with non-regressing eval deltas.

Run n≥5 trials per model per task (deliberately above our pilot n=2–3) and pin model IDs, prompt hashes, and the docs-surface git SHA. A score without a surface version is unreproducible, and unreproducible scores can't prove your work worked.

## Why do AI citation trackers mislead?

Citation trackers (the Otterly / Ahrefs-checker category) re-run prompt panels against chat models and count how often your brand appears in the generated answers. That is a real channel, worth watching if humans ask chatbots about your category. The problem is what the number gets called: "AI visibility." It is mention frequency in *answers*. Agents choosing products in *code* is a different layer that these tools do not see.

Two pieces of evidence for the gap:

- **The academic finding:** published work reports that LLMs show low consistency between the products they recommend in prose and the ones they actually use when writing code. Mention and selection are weakly coupled behaviors in the same model. <!-- EXT: add formal citation for the academic mention-vs-selection consistency study pre-publish -->
- **Our E4 (pilot-grade, n=2 per arm):** agents picking a transactional email API chose Postmark 2/2 on names alone, then flipped to the alternative 2/2 when told it shipped an MCP server, llms.txt, and agent skills. The deciding fact was *operability*, a property that lives in docs and registries. No citation tracker has a column for it.

There's a subtler inversion, too: high retrieval volume can signal a *weak* training prior. Agents look up what they can't remember (Tailwind's docs are heavily fetched partly because utility classes are unmemorizable). A mentions or fetch count read naively can mark you "winning" on the layer that matters least.

Keep the tracker if you have one. Label the column "mentions," and put the four-layer stack above it. <!-- EXT: mention-vs-selection gap study (Study 6): slot for our own headline data --> <!-- EXT: agent-traffic telemetry panel: slot for future data -->

## What does the dashboard look like?

One spreadsheet, one row per weekly snapshot, four panels matching the four stages:

- **Find:** search position per target query (5 columns), registry presence, plus surfaced-rate from the quarterly agent-run trace.
- **Research:** fetch/open rate, benchmark score, hours-since-update, llms.txt/`.md` probe pass rate.
- **Shortlist:** flip rate from the latest eval run (does your presence change the pick?).
- **Act:** deprecation-eval emission rate, integration-completion rate, time-to-first-successful-API-call, skills.sh installs, npm weekly downloads.

Every panel carries four columns: **baseline** (the value the day you started), **current**, **Δ**, and **alert** (the thresholds from the tracker table). A metadata block pins snapshot date, model versions used in evals, and the docs git SHA. That's the whole template. The [downloadable version](/agentic-discovery/resources/visibility-tracker-template) has the formulas and alert conditions pre-wired.

"Proving it worked" is then a before/after read: docs change ships → re-parse triggers → benchmark and positions move within days (our H2 expectation), eval deltas confirm behavior change, installs follow with a lag. If a number doesn't move, the layer above it tells you where the chain broke.

## The receipts

*Research layer. Skip if you just want the tracker.*

**E5: what correlates with retrieval-quality scores** (n=17 audited Context7 entries, 2026-06-11; Pearson r / Spearman ρ):

| Predictor | r | ρ | Reading |
|---|---|---|---|
| log(hours since update) | −0.46 | **−0.54** | Strongest signal: staleness ↔ lower benchmark |
| Trust score | +0.50 | +0.51 | Partly mechanical (shared quality inputs) |
| log(tokens) | +0.39 | +0.30 | Corpus mass buys little |
| log(snippets) | +0.35 | +0.24 | Same: corpus mass buys little |

Freshest-5 entries averaged benchmark **83.6**; stalest-5 averaged **72.3**, an 11.3-point gap. Caveats: correlational, non-random sample of audited winners plus counterexamples; the freshness–benchmark link could partly reflect that actively maintained products also write better docs.

**E4: mention vs selection, the controlled version.** Task: "choose a transactional email API (Postmark / Mailgun / SendGrid)." Control (names only): Postmark 2/2. Treatment (one added fact: Mailgun ships MCP + llms.txt + skills, a deliberately synthetic stimulus, not a claim about Mailgun): Mailgun 2/2, both rationales citing the agent tooling as decisive. Pilot-grade: single model, n=2 per arm, fact surfaced explicitly rather than discovered organically.

**The distribution ledger:** skills.sh's CLI shows 1,376,225 npm downloads/week; Context7's MCP 1,136,447/week; Smithery's visible "uses" for Context7: 6.8K. Numbers as of 2026-06-11. They're the basis for treating package installs, not directory stats, as ground truth.

**Instrument caveats** (full list in the [Data Room](/agentic-discovery/data)): Context7 measures retrieval demand, not selection; its traffic skews ~73% terminal agents (Claude Code 43.4%) on the TypeScript/React stack; share-of-voice percentages are relative within its top 50; it is a vendor-owned instrument; same-day metric discrepancies were observed. Date-stamp everything you quote.

## FAQ

**How do I track my brand's visibility in ChatGPT?**
Citation trackers re-run prompt panels and count brand mentions in chat answers. That covers the answer channel. For the agent channel (coding agents selecting and integrating products), track registry positions, benchmark and freshness scores, install counts, and your own eval pass rates instead. No off-the-shelf tracker measures those.

**What is the single best metric for AI visibility?**
There isn't one. Visibility is four layers with four sources of truth. If forced to pick two: flip rate (does agent product choice change with your surfaces present?) for selection, and npm weekly downloads of your MCP package for distribution.

**Do AI citation trackers actually work?**
They measure what they measure, mention frequency in generated answers, and that channel is real. They mislead only when mentions get relabeled "AI visibility": published academic work finds low consistency between what LLMs recommend in prose and use in code, and our pilot E4 flipped selection 2/2 with an operability fact no answer-panel tracker observes.

**How often should I check my Context7 ranking?**
Weekly, automated (positions, benchmark score, and the "Updated" field), with an alert when freshness exceeds 7 days. Monthly is too slow: openclaw lost 50% of its retrieval share inside 30 days.

**How do I prove docs changes improved AI visibility?**
Baseline first, then ship, then re-measure on the same instrument. Index-layer changes show up within a re-parse cycle (days) in benchmark and position; behavior changes show up as eval deltas; distribution follows with a lag in installs. Without the dated baseline you have anecdotes, not proof.

**How do I measure whether AI agents find me in their web search?**
You instrument the agent's run, not your own analytics (which only show a bare fetch hit). Give an agent a "help me choose a [category]" task with web search on and trace four findings layers: which queries surfaced you (per-search), which pages it opened (per-page), how each sub-agent rated you (per-agent), and whether you made the final answer (synthesis). Our Birdseye tool automates this; [Play 1](/agentic-discovery/ai-agent-web-search-and-fetch) explains the fixes.

---

*Last verified 2026-06-12. We re-test the claims on this page quarterly. Changes are logged in the [Data Room](/agentic-discovery/data).*

**Part of [The Complete Playbook to Agentic Discovery](/agentic-discovery).**

← Previous: [Play 12: Agentic Resource Discovery (ARD)](/agentic-discovery/agentic-resource-discovery) · Next: [GEO Myths: What Doesn't Work](/agentic-discovery/geo-myths-what-doesnt-work) →

> **Stay ahead of the agents.** We re-test this playbook quarterly and publish what changed: new data, busted myths, ranking shifts. [Get the update digest →](/agentic-discovery#updates)
>
> **Want this done for you?** Synscribe runs agentic-discovery programs for B2B SaaS and developer platforms. [Talk to us →](/contact)