> ## Documentation Index
> Fetch the complete guide index at: https://www.synscribe.com/agentic-discovery/llms.txt
> Use this file to discover all pages before exploring further.

---
title: "Code Snippets AI Agents Can Use (and How to Score Them)"
description: "Engineer documentation code snippets AI agents can run: the 5-metric scoring rubric, templates, and why 440 snippets beat 2,297."
slug: /agentic-discovery/code-snippets-for-ai-agents
series: The Agentic Discovery Playbook · Play 7 of 12
last_verified: 2026-06-11
---

# Engineer Code Snippets AI Agents Can Actually Use (and Score Them)

> **In short:** AI retrieval indexes score documentation snippets on five metrics: question coverage, LLM-judged quality (including uniqueness), formatting, metadata noise, and initialization (install and imports included). Density beats bulk: Drizzle's 440 snippets benchmark at 82.8 while Polar's 2,297 score 64.7. Write self-contained, task-shaped, deduplicated snippets, and measure them with your own scorer.

**Works in:** Research & Shortlist. Dense, self-contained snippets get you read, then prove the capability.

![Scatter plot of seventeen documentation index entries showing benchmark quality score against hours since last update on a log scale. The trend is negative, Spearman rho minus 0.54: fresher entries score higher. The freshest five entries average 83.6; the stalest five average 72.3.](/agentic-discovery/images/f5-freshness-vs-quality.svg "Freshness is the strongest correlate of retrieval-quality score we measured (rho = minus 0.54).")

## Do this now

- [ ] Make every snippet self-contained: install command + imports + code + expected output, runnable as pasted.
- [ ] Phrase every doc heading as the task an agent would query: "Hash a password," not "Cryptography utilities."
- [ ] Deduplicate: no two snippets in the corpus demonstrate the same pattern.
- [ ] Map the 25 most common developer questions for your product to exactly one page each.
- [ ] Ship per-framework permutation pages, each fully self-contained.
- [ ] Ship error-message pages: H1 = the exact error string, body = cause + fixed snippet.
- [ ] Point Context7 at your docs **site**, not the GitHub repo, and re-index within 7 days of every release.
- [ ] Build the 5-metric scorer in-house and wire it into CI. The @upstash/c7score package no longer exists (verified 2026-06-11).

*Scope: [Play 5](/agentic-discovery/llms-txt) built the index and [Play 6](/agentic-discovery/markdown-docs-for-ai-agents) made every page fetchable as markdown. This page engineers what's inside the code blocks, and how to score it.*

## Who is scoring your code snippets?

Retrieval indexes are, continuously. Context7 (the docs index behind most coding-agent retrieval) runs a published quality stack: after each parse it generates "common developer questions" about your product (via Gemini plus Google Search), has *your snippets alone* answer them, and an LLM jury (Claude Opus and Gemini Pro as premium juries) scores the answers. The benchmark is re-run after each parse; scores move. Your docs are being examined against the questions developers actually ask the web, whether you participate or not.

That makes snippet engineering the play that moves the benchmark, and the benchmark is an eval you can study for. Resend's docs-site entry scores 92.3, among the highest observed anywhere, on one of the smallest corpora. The mechanics are knowable; this page turns them into writing rules.

**Who needs this play:** anyone whose Context7-class benchmark sits below ~80, whose docs headings read like a feature tour, or whose snippets assume invisible setup. Prerequisites: Plays 5 and 6. Snippets need a clean, indexable surface first.

## What are the five metrics, and what do they demand of your writing?

Upstash published the c7score rubric (weights configurable). Each metric maps directly to a design constraint:

| # | Metric | What it measures | Your design constraint |
|---|---|---|---|
| 1 | Question coverage | Do snippets exist that answer the common developer questions? | One page per common question; write the 25-question list first |
| 2 | LLM judgment | Relevancy, clarity, correctness, **uniqueness across snippets** | Deduplicate; one pattern demonstrated once |
| 3 | Formatting | Fences, language tags, structure | Language-tagged fences; one task per heading |
| 4 | Project metadata | Repo noise: badges, contributor lists, license boilerplate | Index the docs site, not the repo; strip noise |
| 5 | Initialization | Snippets include install + imports | The self-contained template below |

Two of these are deterministic lint checks (3, 4), one is content strategy (1), and two are writing discipline (2, 5). None require guessing what the judge wants. The rubric is documented.

## What does a self-contained snippet look like?

Every snippet on every page follows one shape: install, imports, code, expected output. Nothing assumed, nothing "see setup guide."

````markdown
## Refund a payment
<!-- Task-shaped H2: the exact query an agent issues -->

Refund all or part of a captured payment.

```bash
npm install @paykit/node@latest
```

```ts
import { PayKit } from "@paykit/node";

const paykit = new PayKit(process.env.PAYKIT_SECRET_KEY);

const refund = await paykit.refunds.create({
  payment: "pay_123",
  amount: 500, // partial refund, cents; omit for full
});
console.log(refund.status); // "succeeded"
```
````

The rules this template encodes: install line present (metric 5); all imports present (metric 5); env/config visible, no invisible prerequisites; expected output shown as a comment; language-tagged fences (metric 3); one task per heading; and no narrative between install and code that an extractor could split into a partial snippet. A snippet missing its import fails the initialization metric *and* fails the agent at paste time: a double penalty.

## Why do headings need to be task-shaped?

Because the heading is the retrieval query. Bun's llms.txt indexes ~190 guide pages phrased exactly as agent queries ("Convert a Blob to a string," "Hash a password"): pre-chunked retrieval units that match the question verbatim. Prisma does the structural version: a "## Common Queries" section placed *before* the reference sections.

The counter-pattern: headings like "Why developers love PayKit" match no agent query and dilute lexical search. Every H2 on a guide page should be a task a developer would type. Build the list empirically: mine support tickets, Discord, GitHub issues, and StackOverflow for the 25 questions developers actually ask. That list is simultaneously your eval (below) and your content roadmap, with every question mapping to exactly one page.

## Why deduplicate, and when is repetition allowed?

Uniqueness is scored, not just tidy. Context7's published example dinged Next.js (the #1 ranked library) for repeating two command patterns across its docs. Duplication actively costs points under metric 2.

The workflow: extract all fenced blocks from the corpus, normalize (strip comments and whitespace, mask identifiers), shingle, and flag pairs above ~0.8 similarity. For every flagged pair, keep the canonical page and replace the duplicate with a one-line link.

**The exception: permutation pages.** Agents query "paykit next.js app router," not "paykit." For each top task × top framework, ship a dedicated page (`/docs/frameworks/nextjs.md`) that is *fully* self-contained. Do not factor shared steps into a common page, because retrieval returns one chunk, and a chunk with a "see setup guide" hole fails initialization. Permutation pages legitimately repeat install/import scaffolding; dedup applies to patterns within a page-set's purpose, not to required boilerplate across framework variants. Drizzle is the proof this works at scale: its docs are terse permutation pages ("Get started with Drizzle and X" ×60), and it posts the density numbers below. Keep the framework-specific delta prominent: the raw-body webhook middleware in Express, the route handler in Next.js.

## Should you ship error-message pages?

Yes. Agents grep error strings. When an agent hits `Error: PAYKIT_WEBHOOK_SIGNATURE_INVALID`, it searches the literal string. Ship `/docs/errors/<code>.md` for every common error: H1 = the exact runtime string verbatim, then the cause, then the fixed snippet (self-contained, per the template), then links to related errors. List them all in llms.txt under an Errors section with the error string in the description. That's the lexical match target.

This converts your error vocabulary into retrieval surface no competitor can occupy: nobody else's docs can rank for your exact error strings.

## Does volume help? The density evidence

No, and this is the most counterintuitive, best-supported finding in the dataset. As of 2026-06-11, across 17 audited Context7 entries:

- **Drizzle: 440 snippets, benchmark 82.8. Polar: 2,297 snippets, benchmark 64.7.** Five times the corpus, 18 points worse.
- Corpus mass barely correlates with benchmark at all: ρ≈0.24 for log(snippets), ρ≈0.30 for log(tokens) (Spearman, n=17).
- The strongest correlate is freshness: benchmark vs. log(hours since update), Spearman −0.54. Freshest-5 entries average 83.6; stalest-5 average 72.3, an 11.3-point gap. (Polar's entry was also 1 month stale; Drizzle's a week.)
- Resend hits 92.3 on one of the smallest corpora in the audit. Convex's curated docs-site entry scores 91.6 vs. 79.9 for its noisy repo entry.

The mechanism is the rubric itself: repeated patterns trigger the uniqueness penalty, repo noise triggers the metadata penalty, and padding dilutes retrieval. Write fewer, denser, self-contained snippets, then re-index often. Don't pad.

## How do you build your own scorer?

First, the story told straight: Upstash shipped c7score as a public npm package and open repo, public as of August 2025 per archived snapshots. **As of 2026-06-11, it's gone:** npm search returns zero results and the repo 404s. The scorer is now proprietary-internal. Don't reference @upstash/c7score in tooling or deliverables, since builds depending on it break. But the rubric itself is fully documented in Upstash's published blog posts, which means you can, and should, implement it yourself as LLM-as-judge.

<!-- EXT: open-source scorer release: tool launch moment -->

Two components:

**1. The benchmark.** For each of your 25 eval questions, give a judge LLM *only* your extracted snippets (simulate retrieval: top-k chunks by lexical + embedding match) and the question; a second judge grades the answer against a gold answer. Score = % correct.

**2. The quality score** (five metrics, 0–100 each): coverage, the fraction of the 25 questions with at least one retrieving snippet (deterministic); LLM judgment, a judge rates sampled snippets for relevancy, clarity, correctness, and cross-snippet uniqueness (feed it pairs from your dedup report); formatting, deterministic lint; metadata noise, deterministic scan for badges, license text, contributor sections; initialization, regex plus judge fallback for install + imports.

```python
def benchmark(questions, corpus, judge):
    correct = 0
    for q in questions:
        chunks = retrieve(corpus, q.text, k=5)        # lexical + embedding
        answer = judge.answer(q.text, context=chunks)  # snippets ONLY
        correct += judge.grade(answer, q.gold)         # second pass, binary
    return 100 * correct / len(questions)

def c7_quality(corpus, judge):
    return weighted_mean(
        coverage(corpus, QUESTIONS),       # metric 1, deterministic
        llm_judgment(corpus, judge),       # metric 2, sampled pairs incl. dedup hits
        formatting_lint(corpus),           # metric 3, deterministic
        100 - metadata_noise(corpus),      # metric 4, deterministic
        initialization_check(corpus),      # metric 5, regex + judge fallback
    )
```

Determinism notes: temperature 0 for both judges; run the LLM-judged metrics three times and take the median; pin judge model and prompts in the repo; store per-question results so a regression points at the exact page that broke.

**Wire it into CI** as a `snippet-quality` job per docs release: self-containment lint on top-20 pages, formatting lint (100% language-tagged fences), metadata-noise scan (zero matches), dedup gate (≥0.8-similarity pairs = 0, permutation scaffolding allowlisted), the 25-question benchmark (block below 70, alert on any 5-point drop), error-page coverage (every SDK error constant has a matching page), and a freshness alarm if your public index entry exceeds 7 days post-release. Target: ≥80 on the benchmark, the band the audited leaders occupy.

## The receipts

*The research layer. All data collected 2026-06-11; methodology and updates in the [Data Room](/agentic-discovery/data).*

**The density table (Context7 metadata snapshot, n=17 audit; selected rows):**

| Entry | Tokens | Snippets | Benchmark | Updated |
|---|---|---|---|---|
| Resend (websites/resend) | n/a | small corpus | **92.3** | n/a |
| Convex docs site (websites/convex_dev) | 100,248 | 1,245 | **91.6** | 1 w |
| Bun (oven-sh/bun) | 1,200,491 | 10,309 | 84.4 | 6 h |
| Drizzle (drizzle-team/drizzle-orm) | 96,732 | 440 | **82.8** | 1 w |
| Convex repo (get-convex/convex-backend) | 249,564 | 3,362 | 79.9 | 1 w |
| Polar (polarsource/polar) | 177,277 | 2,297 | **64.7** | 1 mo |
| Hono (honojs/hono) | 3,267 | 48 | 63.3 | 5 d |

Hono is the parse-failure case, not a density case: exemplary llms.txt tiers, but the index extracted almost nothing (48 snippets), and Hono is absent from the top 50. Verify the parse, not just the publish. (Snapshot caveat: these metrics are recomputed continuously; Stripe's snippet count read 130 on its library page and 207 in the search API the same day, so any single snapshot carries error bars.)

**Correlations (n=17, Pearson r / Spearman ρ):** log(hours since update) −0.46 / **−0.54** (strongest signal); log(tokens) +0.39 / +0.30; log(snippets) +0.35 / +0.24; tokens-per-snippet +0.24 / +0.20. Density alone isn't it either; *self-containedness* is. This sample is non-random (audited winners plus counterexamples), and the freshness/benchmark link could partly reflect that actively maintained products also write better docs.

**Scoring mechanics, from Upstash's published posts:** benchmark = an LLM jury answers "common developer questions" using only your snippets, re-run after each parse; uniqueness is explicitly penalized, since their own published example dinged Next.js for repeating two command patterns. Library owners control parse configuration, benchmark questions, and on-demand re-parsing via the ownership dashboard ([Play 2](/agentic-discovery/ai-agent-registries-and-directories) covers claiming it).

**What good snippets do once retrieved (pilot-grade: single model, n=2 per arm):** in experiment E3, correct current-docs excerpts in context cut stale-config emission from 2/2 to 0/2 on a Tailwind v4 task. Retrieval quality upstream, correct code downstream.

**The c7score timeline:** public npm package and repo, August 2025 (archived snapshot); verified removed from npm and GitHub, 2026-06-11; rubric remains fully documented in Upstash's blog posts. Build in-house; cite the posts.

## FAQ

**What makes a code snippet LLM-friendly?**
Self-containment: install command, all imports, the code, and expected output in one block sequence, runnable as pasted. Add a task-shaped heading (the query an agent would issue) and a language-tagged fence. A snippet that assumes invisible setup fails both the scoring rubric and the agent that pastes it.

**What is c7score?**
c7score is Context7's published five-metric quality rubric for documentation snippets: question coverage, LLM judgment (including cross-snippet uniqueness), formatting, project-metadata noise, and initialization. The npm package and public repo were withdrawn (verified gone from npm and GitHub as of 2026-06-11, after being public in August 2025), but the rubric is documented and reproducible in-house.

**How many code snippets should my docs have?**
There is no target count; density beats volume. Drizzle benchmarks 82.8 with 440 snippets while Polar scores 64.7 with 2,297, and corpus mass correlates with benchmark at only ρ≈0.24–0.30 (n=17). Cover the 25 most common developer questions with one self-contained page each, then stop padding.

**What is a Context7 benchmark score?**
It's the score an LLM jury assigns after answering auto-generated "common developer questions" using only your indexed snippets. It re-runs after each parse, so it responds to docs changes within days. The leaders sit in the 80s and low 90s; treat ≥80 as the bar.

**Should snippets repeat the install command on every page?**
On permutation pages, yes: each framework variant must be fully self-contained because retrieval returns one chunk, and a chunk pointing to a separate setup guide fails. Within a single page-set's purpose, no: deduplicate patterns demonstrated more than once. Scaffolding repetition is exempt; pattern repetition is penalized.

---

*Last verified 2026-06-11. We re-test the claims on this page quarterly; changes are logged in the [Data Room](/agentic-discovery/data).*

**Part of [The Complete Playbook to Agentic Discovery](/agentic-discovery).**

← Previous: [Markdown Docs for AI Agents](/agentic-discovery/markdown-docs-for-ai-agents) · Next: [Stop AI Using Your Deprecated APIs](/agentic-discovery/stop-ai-using-deprecated-apis) →

> **Stay ahead of the agents.** We re-test this playbook quarterly and publish what changed: new data, busted myths, ranking shifts. [Get the update digest →](/agentic-discovery#updates)
>
> **Want this done for you?** Synscribe runs agentic-discovery programs for B2B SaaS and developer platforms. [Talk to us →](/contact)