> ## Documentation Index
> Fetch the complete guide index at: https://www.synscribe.com/agentic-discovery/llms.txt
> Use this file to discover all pages before exploring further.

---
title: Inside Convex's Eval-Driven Agent Strategy
description: How Convex tunes agent rules files with evals, runs a public LLM leaderboard, and scored 91.6 on Context7, the highest benchmark we observed.
slug: /agentic-discovery/case-studies/convex
series: The Agentic Discovery Playbook · Case Study
last_verified: 2026-06-11
---

# Inside Convex's Eval-Driven Agent Strategy

> **The lesson:** Convex runs agent compatibility like an engineering discipline. Rules files tuned with rigorous evals. A public LLM leaderboard as proof. A CLI that keeps every project's AGENTS.md current. A curated docs-site index entry that outscores its own repo by 11.7 points. Measure first, then ship surfaces. The order is the strategy.

## At a glance

| | |
|---|---|
| Category | Backend platform (reactive database + serverless functions) |
| Context7 benchmark, docs site (`websites/convex_dev`) | **91.6**, the highest we observed across 18 audited products (2026-06-11) |
| Context7 benchmark, repo (`get-convex/convex-backend`) | 79.9, same day. An 11.7-point gap |
| Docs-site entry size | 1,245 snippets / 100,248 tokens, updated 1 week |
| Agent-skills installs (org total) | ~362.1K (order of magnitude only; telemetry-based) |

## What they built

Convex runs the most disciplined agentic-discovery program we found in 18 product audits. Most teams ship an llms.txt and stop. Convex closed the loop. Every agent-facing surface is tested against evals, and the eval results are public.

The center of the system is `convex_rules.txt`, AI rules files Convex describes as tuned "using rigorous evals." The evals are open-sourced. The results power a public leaderboard at convex.dev/llm-leaderboard that ranks how well frontier models write Convex code. The rules files aren't copy left to age in a repo. They're the output of a measurement pipeline.

| Surface | What Convex ships |
|---|---|
| Rules files | `convex_rules.txt`, tuned "using rigorous evals" |
| Public evals | convex.dev/llm-leaderboard + open evals repo |
| Context-file CLI | `npx convex ai-files` writes *managed sections* into AGENTS.md / CLAUDE.md; per-agent targeting via convex.json (`"agents": ["claude-code","codex","cursor"]`) |
| Agent Skills | `npx skills add get-convex/agent-skills` → `/convex-quickstart`, `/convex-setup-auth` |
| Deployment MCP | `npx convex mcp start`; production access gated behind `--dangerously-enable-production-deployments` |
| Per-tool docs | Dedicated pages for Claude Code, Codex, Cursor, Copilot, and Conductor |
| Background agents | Recipes for scoped throwaway deployments and deployment-scoped deploy keys |

Two design choices stand out. First, the managed-section pattern. `npx convex ai-files` maintains delimited blocks inside AGENTS.md and CLAUDE.md and updates them idempotently. So Convex's guidance in a user's repo stays current without anyone re-pasting. That's a durable context presence, not a one-time paste. Second, the safety rail on the MCP. Production capability sits behind a deliberately scary flag name. That makes the default install safe enough that docs can recommend it without caveats.

And Convex argues the product itself is part of the strategy:

> "Queries are just TypeScript... AI can generate database code using the large training set of TypeScript code without switching to SQL."
> (docs.convex.dev/ai)

That's agent-friendliness as an architecture argument, not a docs argument. Lean on the language the models already know best.

The background-agent recipes round out the picture. Convex documents how to provision *scoped throwaway deployments* with deployment-scoped deploy keys. So an autonomous coding agent can be handed a real backend it cannot use to damage anything that matters. Most vendors haven't yet acknowledged that background agents exist. Convex ships provisioning instructions for them.

## The receipts

All figures observed 2026-06-11 via Context7's library pages and search API. Single-day snapshots carry ±10% error bars (we watched Stripe's snippet count differ between two surfaces on the same day).

**The standout number: 91.6 vs 79.9.** Convex maintains two Context7 entries. The comparison is the cleanest controlled experiment in our dataset: same product, same org, same trust score, same freshness.

| Entry | Tokens | Snippets | Trust | Benchmark | Updated |
|---|---|---|---|---|---|
| `websites/convex_dev` (docs site) | 100,248 | 1,245 | 9.9 | **91.6** | 1 week |
| `get-convex/convex-backend` (repo) | 249,564 | 3,362 | 9.9 | **79.9** | 1 week |

The smaller, curated source wins by 11.7 points. The repo entry has 2.5× the tokens and 2.7× the snippets, and the lower score. Every variable Context7's reputation formula cares about is held constant. The only difference is what got indexed.

The explanation is mechanical, not mysterious. Repos carry READMEs, contributor docs, and issue templates. That's metadata noise the c7score rubric explicitly penalizes (the "Project Metadata" metric). Docs sites are pre-filtered for user-facing content. The same pattern recurs across the audit. Tailwind's indexed entry is its *site*, and Stripe's site entry is its powerhouse. Sites beat repos as agent indexes.

**Where 91.6 sits in the field.** For calibration against the rest of our 17-entry audit batch on the same day: Next.js scores 84.9, Bun 84.4, shadcn/ui 87.1, Supabase 80.3, Resend's site entry 92.3. Convex's docs site is at the top of that range. It's also the only product whose internal process (eval-gated rules and docs changes) plausibly explains *staying* there, since benchmark scores are recomputed after every parse and move continuously.

**Skills adoption:** ~362.1K total installs across Convex's skills org on skills.sh, with `convex-quickstart` showing 43–64K depending on cache. These counts are opt-out telemetry and cache-inconsistent. Treat them as order-of-magnitude evidence of mainstream adoption, nothing more precise.

**The eval claim is verifiable.** "Tuned using rigorous evals" is usually marketing copy. Here it resolves to a public leaderboard and an open evals repo anyone can re-run. In our survey only Next.js (nextjs.org/evals) operates a comparable public benchmark.

## What to copy

- [ ] Point your Context7 entry at your **docs site, not your GitHub repo**. Convex's 11.7-point gap is the cost of indexing the wrong source. ([Play 2](/agentic-discovery/ai-agent-registries-and-directories), [Play 7](/agentic-discovery/code-snippets-for-ai-agents))
- [ ] Build the eval harness before the surfaces, and gate rules-file changes on it. Write "tuned using rigorous evals" only when a CI gate makes it true. ([Play 11](/agentic-discovery/ai-evals-and-leaderboards))
- [ ] Ship a managed-section CLI (`npx {product} ai-files` equivalent) so your AGENTS.md/CLAUDE.md presence updates itself, with per-agent config. ([Play 4](/agentic-discovery/agent-skills-and-agents-md), [Play 9](/agentic-discovery/scaffolder-rules-claude-md))
- [ ] Gate destructive MCP capability behind an explicit, scary flag. Safe defaults are what make "install our MCP" a recommendable instruction. ([Play 3](/agentic-discovery/mcp-server-distribution))
- [ ] Write per-tool setup pages (Claude Code, Codex, Cursor, Copilot) instead of one generic "AI" page. ([Play 10](/agentic-discovery/agent-first-onboarding))
- [ ] Publish the leaderboard only after the internal loop is running. The tuning loop is the ROI. The public board is amplification. ([Play 11](/agentic-discovery/ai-evals-and-leaderboards))

## What NOT to over-copy

- **Survivorship.** We studied Convex because it scores well. We did not observe how much of its fetch demand the eval program *caused* versus how much a well-funded backend platform would have drawn anyway.
- **The architecture argument doesn't transfer.** "Queries are just TypeScript" works because Convex genuinely replaced SQL with TypeScript. If your product's interface isn't natively in the models' largest training distribution, the docs claim won't make it so.
- **A public leaderboard is a quarter-two project, not a week-one one.** It needs N≥5 trials per model per task, automated scoring, and re-runs after every major model release. Built without the internal loop, it's an expensive static page.
- **Snapshot error.** All Context7 metrics here are single-day reads with ±10% error bars. Benchmark scores are recomputed continuously and move.
- **Install counts are directional.** The ~362.1K figure is opt-out telemetry with observed cache inconsistency on other vendors' pages. Never quote such numbers as ground truth.

## FAQ

**Why does Convex's docs site outscore its own repo on Context7?**
Repos carry metadata noise: READMEs, contributor guides, issue templates. Context7's scoring rubric explicitly penalizes that, while docs sites contain only user-facing content. Convex's docs-site entry benchmarks 91.6 against 79.9 for its repo entry (observed 2026-06-11). The fix is universal: submit and claim your docs site as the canonical entry.

**What does `npx convex ai-files` do?**
It writes and maintains managed sections inside a project's AGENTS.md and CLAUDE.md files, sourced from Convex's eval-tuned guidance, with per-agent targeting configured in convex.json (`"agents": ["claude-code","codex","cursor"]`). Because the sections are delimited and updated idempotently, Convex's instructions stay current across version bumps without users re-pasting anything.

**What is the Convex LLM leaderboard?**
A public benchmark at convex.dev/llm-leaderboard ranking how well LLMs generate Convex code, backed by an open evals repo. It's one of only two public, product-specific agent benchmarks we found (the other is nextjs.org/evals), and it doubles as the QA loop that tunes Convex's own rules files.

**Is `--dangerously-enable-production-deployments` a real flag?**
Yes. Convex's MCP server (`npx convex mcp start`) ships with production access off by default, and enabling it requires that explicit flag. It's the reference pattern for shipping a live-system MCP that docs can safely tell agents to install: destructive capability exists, but only behind a name nobody types by accident.

---

*Snapshot date 2026-06-11; single-day metrics carry ±10% error bars. Part of [Case Studies](/agentic-discovery/case-studies) · [The Complete Playbook to Agentic Discovery](/agentic-discovery).*

← Previous: [Next.js: the Maximalist](/agentic-discovery/case-studies/nextjs) · Next: [How Stripe Fights Its Own Training-Data Ghost](/agentic-discovery/case-studies/stripe) →

> **Stay ahead of the agents.** We re-test this playbook quarterly and publish what changed: new data, busted myths, ranking shifts. [Get the update digest →](/agentic-discovery#updates)
>
> **Want this done for you?** Synscribe runs agentic-discovery programs for B2B SaaS and developer platforms. [Talk to us →](/contact)