> ## Documentation Index
> Fetch the complete guide index at: https://www.synscribe.com/agentic-discovery/llms.txt
> Use this file to discover all pages before exploring further.

---
title: "Code Snippets AI Agents Can Use (and How to Score Them)"
description: "Engineer documentation code snippets AI agents can run: the 5-metric scoring rubric, templates, and why 440 snippets beat 2,297."
slug: /agentic-discovery/code-snippets-for-ai-agents
series: The Agentic Discovery Playbook — Play 7 of 11 · GET READ
last_verified: 2026-06-11
---

# Engineer Code Snippets AI Agents Can Actually Use (and Score Them)

> **In short:** AI retrieval indexes score documentation snippets on five metrics: question coverage, LLM-judged quality (including uniqueness), formatting, metadata noise, and initialization — install and imports included. Density beats bulk: Drizzle's 440 snippets benchmark at 82.8 while Polar's 2,297 score 64.7. Write self-contained, task-shaped, deduplicated snippets, and measure them with your own scorer.

![Scatter plot of seventeen documentation index entries showing benchmark quality score against hours since last update on a log scale. The trend is negative, Spearman rho minus 0.54: fresher entries score higher. The freshest five entries average 83.6; the stalest five average 72.3.](/agentic-discovery/images/f5-freshness-vs-quality.svg "Freshness is the strongest correlate of retrieval-quality score we measured (ρ = −0.54).")

## Do this now

- [ ] Make every snippet self-contained: install command + imports + code + expected output, runnable as pasted.
- [ ] Phrase every doc heading as the task an agent would query — "Hash a password," not "Cryptography utilities."
- [ ] Deduplicate: no two snippets in the corpus demonstrate the same pattern.
- [ ] Map the 25 most common developer questions for your product to exactly one page each.
- [ ] Ship per-framework permutation pages, each fully self-contained.
- [ ] Ship error-message pages: H1 = the exact error string, body = cause + fixed snippet.
- [ ] Point Context7 at your docs **site**, not the GitHub repo, and re-index within 7 days of every release.
- [ ] Build the 5-metric scorer in-house and wire it into CI — the @upstash/c7score package no longer exists (verified 2026-06-11).

*Scope: [Play 5](/agentic-discovery/llms-txt) built the index and [Play 6](/agentic-discovery/markdown-docs-for-ai-agents) made every page fetchable as markdown. This page engineers what's inside the code blocks — and how to score it.*

## Who is scoring your code snippets?

Retrieval indexes are, continuously. Context7 — the docs index behind most coding-agent retrieval — runs a published quality stack: after each parse it generates "common developer questions" about your product (via Gemini plus Google Search), has *your snippets alone* answer them, and an LLM jury (Claude Opus and Gemini Pro as premium juries) scores the answers. The benchmark is re-run after each parse; scores move. Your docs are being examined against the questions developers actually ask the web, whether you participate or not.

That makes snippet engineering the play that moves the benchmark — and the benchmark is an eval you can study for. Resend's docs-site entry scores 92.3, among the highest observed anywhere, on one of the smallest corpora. The mechanics are knowable; this page turns them into writing rules.

**Who needs this play:** anyone whose Context7-class benchmark sits below ~80, whose docs headings read like a feature tour, or whose snippets assume invisible setup. Prerequisites: Plays 5 and 6 — snippets need a clean, indexable surface first.

## What are the five metrics — and what do they demand of your writing?

Upstash published the c7score rubric (weights configurable). Each metric maps directly to a design constraint:

| # | Metric | What it measures | Your design constraint |
|---|---|---|---|
| 1 | Question coverage | Do snippets exist that answer the common developer questions? | One page per common question — write the 25-question list first |
| 2 | LLM judgment | Relevancy, clarity, correctness, **uniqueness across snippets** | Deduplicate; one pattern demonstrated once |
| 3 | Formatting | Fences, language tags, structure | Language-tagged fences; one task per heading |
| 4 | Project metadata | Repo noise: badges, contributor lists, license boilerplate | Index the docs site, not the repo; strip noise |
| 5 | Initialization | Snippets include install + imports | The self-contained template below |

Two of these are deterministic lint checks (3, 4), one is content strategy (1), and two are writing discipline (2, 5). None require guessing what the judge wants — the rubric is documented.

## What does a self-contained snippet look like?

Every snippet on every page follows one shape: install, imports, code, expected output. Nothing assumed, nothing "see setup guide."

````markdown
## Refund a payment
<!-- Task-shaped H2: the exact query an agent issues -->

Refund all or part of a captured payment.

```bash
npm install @paykit/node@latest
```

```ts
import { PayKit } from "@paykit/node";

const paykit = new PayKit(process.env.PAYKIT_SECRET_KEY);

const refund = await paykit.refunds.create({
  payment: "pay_123",
  amount: 500, // partial refund, cents; omit for full
});
console.log(refund.status); // "succeeded"
```
````

The rules this template encodes: install line present (metric 5); all imports present (metric 5); env/config visible — no invisible prerequisites; expected output shown as a comment; language-tagged fences (metric 3); one task per heading; and no narrative between install and code that an extractor could split into a partial snippet. A snippet missing its import fails the initialization metric *and* fails the agent at paste time — a double penalty.

## Why do headings need to be task-shaped?

Because the heading is the retrieval query. Bun's llms.txt indexes ~190 guide pages phrased exactly as agent queries — "Convert a Blob to a string," "Hash a password" — pre-chunked retrieval units that match the question verbatim. Prisma does the structural version: a "## Common Queries" section placed *before* the reference sections.

The counter-pattern: headings like "Why developers love PayKit" match no agent query and dilute lexical search. Every H2 on a guide page should be a task a developer would type. Build the list empirically: mine support tickets, Discord, GitHub issues, and StackOverflow for the 25 questions developers actually ask — that list is simultaneously your eval (below) and your content roadmap, with every question mapping to exactly one page.

## Why deduplicate — and when is repetition allowed?

Uniqueness is scored, not just tidy. Context7's published example dinged Next.js — the #1 ranked library — for repeating two command patterns across its docs. Duplication actively costs points under metric 2.

The workflow: extract all fenced blocks from the corpus, normalize (strip comments and whitespace, mask identifiers), shingle, and flag pairs above ~0.8 similarity. For every flagged pair, keep the canonical page and replace the duplicate with a one-line link.

**The exception — permutation pages.** Agents query "paykit next.js app router," not "paykit." For each top task × top framework, ship a dedicated page (`/docs/frameworks/nextjs.md`) that is *fully* self-contained — do not factor shared steps into a common page, because retrieval returns one chunk, and a chunk with a "see setup guide" hole fails initialization. Permutation pages legitimately repeat install/import scaffolding; dedup applies to patterns within a page-set's purpose, not to required boilerplate across framework variants. Drizzle is the proof this works at scale: its docs are terse permutation pages ("Get started with Drizzle and X" ×60), and it posts the density numbers below. Keep the framework-specific delta prominent — the raw-body webhook middleware in Express, the route handler in Next.js.

## Should you ship error-message pages?

Yes — agents grep error strings. When an agent hits `Error: PAYKIT_WEBHOOK_SIGNATURE_INVALID`, it searches the literal string. Ship `/docs/errors/<code>.md` for every common error: H1 = the exact runtime string verbatim, then the cause, then the fixed snippet (self-contained, per the template), then links to related errors. List them all in llms.txt under an Errors section with the error string in the description — that's the lexical match target.

This converts your error vocabulary into retrieval surface no competitor can occupy: nobody else's docs can rank for your exact error strings.

## Does volume help? The density evidence

No — and this is the most counterintuitive, best-supported finding in the dataset. As of 2026-06-11, across 17 audited Context7 entries:

- **Drizzle: 440 snippets, benchmark 82.8. Polar: 2,297 snippets, benchmark 64.7.** Five times the corpus, 18 points worse.
- Corpus mass barely correlates with benchmark at all: ρ≈0.24 for log(snippets), ρ≈0.30 for log(tokens) (Spearman, n=17).
- The strongest correlate is freshness: benchmark vs. log(hours since update), Spearman −0.54. Freshest-5 entries average 83.6; stalest-5 average 72.3 — an 11.3-point gap. (Polar's entry was also 1 month stale; Drizzle's a week.)
- Resend hits 92.3 on one of the smallest corpora in the audit. Convex's curated docs-site entry scores 91.6 vs. 79.9 for its noisy repo entry.

The mechanism is the rubric itself: repeated patterns trigger the uniqueness penalty, repo noise triggers the metadata penalty, and padding dilutes retrieval. Write fewer, denser, self-contained snippets — then re-index often. Don't pad.

## How do you build your own scorer?

First, the story told straight: Upstash shipped c7score as a public npm package and open repo — public as of August 2025 per archived snapshots. **As of 2026-06-11, it's gone**: npm search returns zero results and the repo 404s. The scorer is now proprietary-internal. Don't reference @upstash/c7score in tooling or deliverables — builds depending on it break. But the rubric itself is fully documented in Upstash's published blog posts, which means you can — and should — implement it yourself as LLM-as-judge.

<!-- EXT: open-source scorer release — tool launch moment -->

Two components:

**1. The benchmark.** For each of your 25 eval questions, give a judge LLM *only* your extracted snippets (simulate retrieval: top-k chunks by lexical + embedding match) and the question; a second judge grades the answer against a gold answer. Score = % correct.

**2. The quality score** (five metrics, 0–100 each): coverage — fraction of the 25 questions with at least one retrieving snippet (deterministic); LLM judgment — a judge rates sampled snippets for relevancy, clarity, correctness, and cross-snippet uniqueness (feed it pairs from your dedup report); formatting — deterministic lint; metadata noise — deterministic scan for badges, license text, contributor sections; initialization — regex plus judge fallback for install + imports.

```python
def benchmark(questions, corpus, judge):
    correct = 0
    for q in questions:
        chunks = retrieve(corpus, q.text, k=5)        # lexical + embedding
        answer = judge.answer(q.text, context=chunks)  # snippets ONLY
        correct += judge.grade(answer, q.gold)         # second pass, binary
    return 100 * correct / len(questions)

def c7_quality(corpus, judge):
    return weighted_mean(
        coverage(corpus, QUESTIONS),       # metric 1, deterministic
        llm_judgment(corpus, judge),       # metric 2, sampled pairs incl. dedup hits
        formatting_lint(corpus),           # metric 3, deterministic
        100 - metadata_noise(corpus),      # metric 4, deterministic
        initialization_check(corpus),      # metric 5, regex + judge fallback
    )
```

Determinism notes: temperature 0 for both judges; run the LLM-judged metrics three times and take the median; pin judge model and prompts in the repo; store per-question results so a regression points at the exact page that broke.

**Wire it into CI** as a `snippet-quality` job per docs release: self-containment lint on top-20 pages, formatting lint (100% language-tagged fences), metadata-noise scan (zero matches), dedup gate (≥0.8-similarity pairs = 0, permutation scaffolding allowlisted), the 25-question benchmark (block below 70, alert on any 5-point drop), error-page coverage (every SDK error constant has a matching page), and a freshness alarm if your public index entry exceeds 7 days post-release. Target: ≥80 on the benchmark — the band the audited leaders occupy.

## The receipts

*The research layer. All data collected 2026-06-11; methodology and updates in the [Data Room](/agentic-discovery/data).*

**The density table (Context7 metadata snapshot, n=17 audit; selected rows):**

| Entry | Tokens | Snippets | Benchmark | Updated |
|---|---|---|---|---|
| Resend (websites/resend) | — | small corpus | **92.3** | — |
| Convex docs site (websites/convex_dev) | 100,248 | 1,245 | **91.6** | 1 w |
| Bun (oven-sh/bun) | 1,200,491 | 10,309 | 84.4 | 6 h |
| Drizzle (drizzle-team/drizzle-orm) | 96,732 | 440 | **82.8** | 1 w |
| Convex repo (get-convex/convex-backend) | 249,564 | 3,362 | 79.9 | 1 w |
| Polar (polarsource/polar) | 177,277 | 2,297 | **64.7** | 1 mo |
| Hono (honojs/hono) | 3,267 | 48 | 63.3 | 5 d |

Hono is the parse-failure case, not a density case: exemplary llms.txt tiers, but the index extracted almost nothing — 48 snippets — and Hono is absent from the top 50. Verify the parse, not just the publish. (Snapshot caveat: these metrics are recomputed continuously; Stripe's snippet count read 130 on its library page and 207 in the search API the same day, so any single snapshot carries error bars.)

**Correlations (n=17, Pearson r / Spearman ρ):** log(hours since update) −0.46 / **−0.54** (strongest signal); log(tokens) +0.39 / +0.30; log(snippets) +0.35 / +0.24; tokens-per-snippet +0.24 / +0.20 — density alone isn't it either; *self-containedness* is. This sample is non-random (audited winners plus counterexamples), and the freshness–benchmark link could partly reflect that actively maintained products also write better docs.

**Scoring mechanics, from Upstash's published posts:** benchmark = an LLM jury answers "common developer questions" using only your snippets, re-run after each parse; uniqueness is explicitly penalized — their own published example dinged Next.js for repeating two command patterns. Library owners control parse configuration, benchmark questions, and on-demand re-parsing via the ownership dashboard ([Play 2](/agentic-discovery/ai-agent-registries-and-directories) covers claiming it).

**What good snippets do once retrieved (pilot-grade: single model, n=2 per arm):** in experiment E3, correct current-docs excerpts in context cut stale-config emission from 2/2 to 0/2 on a Tailwind v4 task. Retrieval quality upstream, correct code downstream.

**The c7score timeline:** public npm package and repo, August 2025 (archived snapshot) → verified removed from npm and GitHub, 2026-06-11 → rubric remains fully documented in Upstash's blog posts. Build in-house; cite the posts.

## FAQ

**What makes a code snippet LLM-friendly?**
Self-containment: install command, all imports, the code, and expected output in one block sequence, runnable as pasted. Add a task-shaped heading (the query an agent would issue) and a language-tagged fence. A snippet that assumes invisible setup fails both the scoring rubric and the agent that pastes it.

**What is c7score?**
c7score is Context7's published five-metric quality rubric for documentation snippets: question coverage, LLM judgment (including cross-snippet uniqueness), formatting, project-metadata noise, and initialization. The npm package and public repo were withdrawn — verified gone from npm and GitHub as of 2026-06-11, after being public in August 2025 — but the rubric is documented and reproducible in-house.

**How many code snippets should my docs have?**
There is no target count — density beats volume. Drizzle benchmarks 82.8 with 440 snippets while Polar scores 64.7 with 2,297, and corpus mass correlates with benchmark at only ρ≈0.24–0.30 (n=17). Cover the 25 most common developer questions with one self-contained page each, then stop padding.

**What is a Context7 benchmark score?**
It's the score an LLM jury assigns after answering auto-generated "common developer questions" using only your indexed snippets. It re-runs after each parse, so it responds to docs changes within days. The leaders sit in the 80s and low 90s; treat ≥80 as the bar.

**Should snippets repeat the install command on every page?**
On permutation pages, yes — each framework variant must be fully self-contained because retrieval returns one chunk, and a chunk pointing to a separate setup guide fails. Within a single page-set's purpose, no — deduplicate patterns demonstrated more than once. Scaffolding repetition is exempt; pattern repetition is penalized.

---

*Last verified 2026-06-11. We re-test the claims on this page quarterly — changes are logged in the [Data Room](/agentic-discovery/data).*

**Part of [The Complete Playbook to Agentic Discovery](/agentic-discovery).**

← Previous: [Markdown Docs for AI Agents](/agentic-discovery/markdown-docs-for-ai-agents) · Next: [Stop AI Using Your Deprecated APIs](/agentic-discovery/stop-ai-using-deprecated-apis) →

> **Stay ahead of the agents.** We re-test this playbook quarterly and publish what changed — new data, busted myths, ranking shifts. [Get the update digest →](/agentic-discovery#updates)
>
> **Want this done for you?** Synscribe runs agentic-discovery programs for B2B SaaS and developer platforms. [Talk to us →](/contact)
