> ## Documentation Index
> Fetch the complete guide index at: https://www.synscribe.com/agentic-discovery/llms.txt
> Use this file to discover all pages before exploring further.

---
title: LLM Evals and Public AI Leaderboards: A Playbook
description: How to build LLM evals for your docs and publish a public AI leaderboard: 4 suites, Next.js and Convex precedents, and a reproducible repo design.
slug: /agentic-discovery/ai-evals-and-leaderboards
series: The Agentic Discovery Playbook — Play 11 of 11 · GET INSTALLED
last_verified: 2026-06-11
---

# AI Evals and Public Leaderboards: QA Tool, Moat, and Citation Magnet

> **In short:** Build an internal eval harness first — deprecation, flip-rate, integration-completion, and doc-QA suites — and gate every agent-facing docs change on it. Then publish it as a versioned, reproducible public leaderboard with raw runs. Precedents: nextjs.org/evals and convex.dev/llm-leaderboard. The internal tuning loop is the ROI; the public board is amplification.

## Do this now

- [ ] Build the internal harness first: deprecation, flip-rate, integration-completion, and 25-question doc-QA suites.
- [ ] Automate all scoring — regex/AST matchers and test pass/fail; reserve an LLM jury (pinned model, fixed rubric) for doc-QA only.
- [ ] Run N≥5 trials per model per task; store model ID, date, prompt hash, and docs-surface git SHA with every run.
- [ ] Gate every docs/rules-file change in CI on non-regressing eval deltas.
- [ ] Publish the leaderboard as a models × tasks matrix — every cell date-stamped, versioned, linked to raw transcripts.
- [ ] Open-source the repo: prompts, scorers, raw run transcripts, reproduction instructions.
- [ ] Automate re-runs: within 2 weeks of every major model release, monthly otherwise.
- [ ] Write "tuned using rigorous evals" in your rules files only if the CI gate actually exists.

> 📥 **Free resource:** [Eval-harness starter repo](/agentic-discovery/resources/eval-harness-starter)

**Who needs this play:** teams that built the surfaces in [Play 8](/agentic-discovery/stop-ai-using-deprecated-apis), [Play 9](/agentic-discovery/scaffolder-rules-claude-md), or [Play 10](/agentic-discovery/agent-first-onboarding) and need to know they work. The harness is pointless without something to tune — build at least one of those surfaces first.

There's a demand-side payoff beyond QA: a public, reproducible leaderboard is exactly the kind of independent, primary source an agent trusts when it verifies claims on the open web. It's the third-party corroboration that survives the [verification cut](/agentic-discovery/ai-agent-web-search-and-fetch) — where self-reported vendor numbers get killed ~48% of the time.

## What are AI evals for a developer product?

An eval is a scripted task plus automated scoring that measures how well coding agents use *your* product: do they emit your deprecated APIs, pick you when a rules file is present, complete an integration from a cold repo, answer developer questions from your docs alone. Run on a schedule across frontier models, the same harness serves three jobs in order: QA tool (catch regressions before users), tuning loop (make every docs change evidence-based), and — only then — public leaderboard (the marketing asset).

That ordering is the play. Publishing scores you never act on burns budget for zero product gain; the internal loop is the ROI, and the public board is amplification.

## Why build the harness before the leaderboard?

Four reasons, each grounded in our own pilot experiments (single model family, Claude Haiku 4.5, n=2–3 per arm, run 2026-06-11 — directional results; your production harness must exceed this n):

1. **Evals find the real lever.** Our correlation study (E5, n=17 indexed docs entries) found benchmark scores tracked log(hours since update) at Spearman −0.54 — the strongest correlate — while corpus mass barely mattered (ρ≈0.24–0.30). The freshest five entries averaged 83.6 vs. 72.3 for the stalest five. Without an eval, teams optimize volume; with one, they discover freshness is the lever.
2. **Evals catch the stale window before users do.** In E3, control agents emitted 100% broken Tailwind v4 setups, including a command that no longer exists. A standing deprecation eval flags that the week the breaking release ships — months before training cycles catch up.
3. **Flip rate is the cleanest "are we the default?" KPI.** E1 measured 0/3 → 3/3 product choice from a rules file. With/without flip rate is the single number that tells you whether your agent surfaces actually steer selection.
4. **A published leaderboard is itself a selection input.** In E4, a discoverable agent-tooling fact flipped vendor choice 2/2. A public, linkable eval is precisely such a fact — it enters agent and human context during vendor evaluation.

## What are the four eval suites?

```
evals/
  tasks/deprecation/*.yaml      # prompt, matchers[], target
  tasks/flip/*.yaml             # task, control_ctx, treatment_ctx, choice_matchers
  tasks/integration/*.yaml      # repo template, success_test, timeout
  tasks/docqa/questions.yaml    # 25 questions + answer key
  run.py                        # model matrix x tasks x N trials -> runs/<date>/<model>/*.json
  score.py                      # deterministic + jury scoring -> scores.csv
```

**1. Deprecation eval** — 20 prompts that historically elicit your old APIs, scored with deterministic matchers; target <5% deprecated emission. Full design in [Play 8](/agentic-discovery/stop-ai-using-deprecated-apis). A task file, modeled on our E3:

```yaml
id: tailwind-v4-vite-setup
prompt: "Set up Tailwind CSS in a Vite + React project."
context: llms_txt_excerpt   # treatment surface under test; omit for control arm
matchers_fail:              # any hit = deprecated emission
  - "tailwind.config.js"
  - "postcss.config.js"
  - "npx tailwindcss init"
matchers_pass:
  - "@tailwindcss/vite"
  - '@import "tailwindcss";'
target_emission: 0.05
trials_per_model: 5
```

**2. Flip-rate eval** — the canonical fresh-project task, run with vs. without your rules file in context; score which product the agent chose. Full design in [Play 9](/agentic-discovery/scaffolder-rules-claude-md). Modeled on our E1:

```yaml
id: fresh-nextjs-auth-choice
prompt: "Add authentication to this fresh Next.js app."
control_ctx: none
treatment_ctx: agents_md_payload   # the Play 9 rules file
choice_matchers: {ours: "stack-auth", competitors: ["next-auth", "@auth0", "@clerk"]}
pass_threshold: 0.90
```

**3. Integration-completion eval** — fresh repo, task "integrate <product> end-to-end"; measure completion rate and time-to-green-test per model. Also run the zero-human variant with time-to-first-successful-API-call. Full design in [Play 10](/agentic-discovery/agent-first-onboarding).

**4. 25-question doc eval** — Context7-style mechanics: an LLM jury answers common developer questions *using only your docs/snippets as context*, scored against a fixed answer key. Note for builders: the @upstash/c7score package — five metrics: question coverage; relevancy/clarity/correctness/uniqueness; formatting; metadata noise; initialization — was removed from npm and GitHub (verified 2026-06-11; now proprietary). **Build your own scorer**; the five-metric list is a usable spec, and [Play 7](/agentic-discovery/code-snippets-for-ai-agents) walks through it.

Pin everything: model IDs, temperature, prompt hashes, docs-surface git SHA. A score without a surface version is unreproducible.

## How does the tuning loop work? (the actual payoff)

Convex's pattern is the reference: its agent rules files are described as "tuned using rigorous evals" (convex_rules.txt). The process:

1. Run the suites.
2. Take the worst-scoring task.
3. Edit the responsible surface — a directive's wording, a rules-file line, a doc snippet.
4. Re-run. Keep the change only if the delta is non-negative.
5. Gate every Play 8 / Play 9 surface change on this loop in CI.

Architecture-level findings feed product, not just docs. Convex argues its architecture itself is LLM-friendly — verbatim: "Queries are just TypeScript... AI can generate database code using the large training set of TypeScript code without switching to SQL." When your eval says agents systematically fail a flow, sometimes the fix is the API, not the prose.

## How do you design a leaderboard people trust?

Follow the two public precedents — Next.js (nextjs.org/evals, with the companion essay "Building Next.js for an agentic future") and Convex (convex.dev/llm-leaderboard, with an open evals repo):

- **A models × tasks matrix.** Rows are frontier models; columns are your eval suites; cells are pass rates (plus time-to-green where applicable).
- **Versioned.** Every published run stamped with date, model versions, harness git SHA, and docs SHA. Keep historical runs visible — the trendline ("model X improved on our product between releases") is the story.
- **Reproducible.** Open-source the prompts, scorers, and raw run transcripts with reproduction instructions. "Trust us" leaderboards get dismissed; reproducible ones get cited.
- **Framed.** A short essay on why the tasks matter, per the Next.js precedent.

Illustrative cell layout (structure only — populate with your real runs):

| Model | Deprecation (<5%?) | Flip rate | Integration completion | Doc QA (25q) | Run date |
|---|---|---|---|---|---|
| Model A vN | pass/fail + % | % | % + median time-to-green | x/25 | YYYY-MM-DD |
| Model B vN | ... | ... | ... | ... | ... |

Each cell links to its raw transcripts in the open repo.

## How does a leaderboard become a citation magnet?

- **Every re-run is a content event.** A changelog entry and a post ("Model X now scores Y on our integration eval") on a cadence tied to model releases — recurring, legitimate news.
- **The leaderboard URL goes into llms.txt and your docs.** Per our E4 pilot, agent-readiness facts in context influence selection; a public eval is the most concrete such fact you can state.
- **Niche, reproducible benchmarks are scarce.** Model vendors and aggregators cite third-party task benchmarks; a well-run eval for your category is citation bait precisely because almost nobody does it.
- **The honesty rule:** claim "tuned using rigorous evals" in rules files only if the CI gate from the tuning loop actually exists. Agents repeat your claims as fact (we measured exactly this in E1); make them true.

## What are the risks — and what actually mitigates them?

| Risk | Mitigation |
|---|---|
| Cherry-picking accusations | Publish methodology, raw run transcripts, and the reproduction repo. If a model scores badly, leave it on the board. This is the only mitigation that works. |
| Staleness | Re-run within 2 weeks of every major model release, monthly otherwise; stamp every cell with its run date. A board showing models from two releases ago signals neglect — the opposite of agent-ready. |
| Pilot-grade n presented as truth | Run N≥5 per cell and report counts, not just percentages. (Our own experiments ran n=2–3 and are labeled pilot-grade everywhere they're quoted; publishing at that n invites statistical demolition.) |
| LLM-jury drift | Use deterministic matchers/tests for deprecation, flip, and integration suites; reserve the jury for doc-QA, with a pinned jury model and fixed rubric. |
| Leaderboard without the loop | Build the internal CI gate first. The board amplifies a working process; it doesn't replace one. |
| Third-party scorer dependency | Own your scorer end-to-end. c7score's removal from npm/GitHub stranded anyone who built on it. |
| Benchmark overfit | Hold out a rotating 25% of prompts; refresh the prompt set quarterly. Acing your own 20 prompts can degrade real-world behavior. |

## The receipts

*The research layer — sources and numbers. Observed 2026-06-11; our experiments are pilot-grade (single model, n=2–3 per arm).*

**Field precedents:**

- **Next.js:** public agent benchmark at nextjs.org/evals; companion post "Building Next.js for an agentic future."
- **Convex:** public LLM leaderboard (convex.dev/llm-leaderboard) plus an open evals repo; rules files "tuned using rigorous evals"; docs argue the architecture is LLM-friendly ("Queries are just TypeScript..."). Convex's docs-site entry benchmarks 91.6 on Context7 — the highest score we observed across 18 audited products.
- **Context7's own benchmark mechanics** (a reference design): after each parse, an LLM jury answers "common developer questions" using only the library's snippets; scores re-run continuously. The companion @upstash/c7score scorer was withdrawn from npm/GitHub (verified 2026-06-11) — now proprietary.

**E5, the worked example of eval-driven insight (n=17 entries):**

| Predictor of benchmark score | Spearman ρ | Reading |
|---|---|---|
| log(hours since update) | **−0.54** | Strongest signal: staleness ↔ lower score |
| log(tokens) | +0.30 | Corpus mass buys little |
| log(snippets) | +0.24 | — |

Freshest-5 average 83.6 vs. stalest-5 average 72.3 — an 11.3-point gap, larger than the gap between the median product and the category leader. The pairwise illustration: Drizzle's 440 snippets score 82.8 while Polar's 2,297 score 64.7. This is the kind of lever only measurement reveals — and the argument for running your own harness.

**The meta-eval** (run it on your harness itself): every directive has ≥1 deprecation prompt; a clean-room run from the public repo reproduces published pass rates within ±10 points per cell; deterministic scorers return identical scores on identical transcripts; the newest published run is ≤30 days old and ≤14 days after the latest major model release; CI blocks surface merges lacking an eval run; every "tuned using rigorous evals" claim resolves to a logged run ID.

## FAQ

**What is an LLM eval for documentation?**
It's an automated test of how well AI models perform real tasks with your product using your docs — answering developer questions from your snippets alone, avoiding deprecated APIs, completing integrations. Scoring is automated (matchers, test runs, or a pinned LLM jury), so it runs in CI like any other test suite.

**Should my company publish an AI leaderboard?**
Only after the internal harness is running and tuning your docs — the loop is the ROI, the board is amplification. If you do publish, make it reproducible: versioned runs, open prompts and scorers, raw transcripts. Next.js (nextjs.org/evals) and Convex (convex.dev/llm-leaderboard) are the precedents worth copying.

**How many trials make an eval credible?**
Run at least N≥5 trials per model per task and report counts, not just percentages. Our own published experiments ran n=2–3 per arm on a single model family — useful as labeled pilot signals, but below the bar for a public benchmark, which is why we state the n everywhere they're quoted.

**What happened to c7score?**
The @upstash/c7score package — the open-source scorer behind Context7's quality rubric — was removed from npm and GitHub and is now proprietary (verified 2026-06-11). Its five documented metrics (question coverage; relevancy/clarity/correctness/uniqueness; formatting; metadata noise; initialization) remain a usable spec for building your own scorer; see [Play 7](/agentic-discovery/code-snippets-for-ai-agents).

**How often should a public AI benchmark re-run?**
Within two weeks of every major frontier-model release, and monthly otherwise; internal suites also re-run on every docs or rules-file change. Stamp every published cell with its run date — per our freshness analysis (ρ=−0.54, the strongest quality correlate we measured), a stale board reads as neglect.

---

*Last verified 2026-06-11. We re-test the claims on this page quarterly — changes are logged in the [Data Room](/agentic-discovery/data).*

**Part of [The Complete Playbook to Agentic Discovery](/agentic-discovery).**

← Previous: [Agent-First Onboarding](/agentic-discovery/agent-first-onboarding) · Next: [Measuring Agentic Visibility](/agentic-discovery/measure-ai-visibility) →

> **Stay ahead of the agents.** We re-test this playbook quarterly and publish what changed — new data, busted myths, ranking shifts. [Get the update digest →](/agentic-discovery#updates)
>
> **Want this done for you?** Synscribe runs agentic-discovery programs for B2B SaaS and developer platforms. [Talk to us →](/contact)
