> ## Documentation Index > Fetch the complete guide index at: https://www.synscribe.com/agentic-discovery/llms.txt > Use this file to discover all pages before exploring further. --- title: "Eval-Harness Starter: Repo Skeleton for Agent Evals" description: "A copy-paste repo skeleton for agent evals: four suite YAMLs, runner pseudocode, an LLM-jury rubric, and the publishing rules that make results credible." slug: /agentic-discovery/resources/eval-harness-starter series: The Agentic Discovery Playbook · Resource last_verified: 2026-06-11 --- # Eval-Harness Starter: The Four-Suite Repo Skeleton **What this is:** the complete skeleton of the eval harness from [Play 11](/agentic-discovery/ai-evals-and-leaderboards), on one copy-paste page: directory tree, one suite YAML per suite (deprecation, flip-rate, integration, doc-QA), runner pseudocode, the doc-QA jury rubric, and the publishing rules. ## How to use it - Recreate the tree, drop in the four `suite.yaml` files below, and point `runner/models.yaml` at ≥2 pinned frontier models, N≥5 trials per model per task. Pin everything: model IDs, temperature, prompt hashes, docs git SHA, harness git SHA. A score without a surface version is unreproducible. - Build at least one surface first ([Play 8](/agentic-discovery/stop-ai-using-deprecated-apis) directives, [Play 9](/agentic-discovery/scaffolder-rules-claude-md) rules file). The harness is pointless without something to tune. Then gate every docs/rules-file change in CI on non-regressing eval deltas: run, take the worst-scoring task, edit the responsible surface, re-run, keep only non-negative deltas. - Commit `results/` to git: raw transcripts, versioned by date and model. That archive is what makes a later public leaderboard credible. ## Directory tree ```text {product}-agent-evals/ ├── README.md # what this measures + reproduction instructions │ # (install, API keys, `python runner/run.py --manifest manifest.yaml`) ├── manifest.yaml # suite registry + pinned run settings (below) ├── evals/ │ ├── deprecation/ │ │ ├── suite.yaml # gate: deprecated emission < 5% │ │ └── tasks/*.yaml # 20 prompts, see the deprecation prompt pack resource │ ├── flip-rate/ │ │ ├── suite.yaml # gate: treatment choice >= 90%, zero blocklisted patterns │ │ └── tasks/*.yaml │ ├── integration/ │ │ ├── suite.yaml # completion rate + time-to-green │ │ ├── tasks/*.yaml │ │ └── templates/ # cold starter repos the agent works in │ └── doc-qa/ │ ├── suite.yaml # LLM jury: pinned model, fixed rubric │ ├── questions.yaml # 25 questions + answer key │ └── rubric.md # the five-dimension jury rubric (below) ├── runner/ │ ├── run.py # model matrix × tasks × N trials → raw JSON │ ├── score.py # deterministic matchers + jury → scores.csv + summary table │ └── models.yaml # pinned model IDs, temperature, API config └── results/ └── runs/// # raw transcripts + scores, committed to git └── /-.json ``` ## Suite manifest (`manifest.yaml`) ```yaml # manifest.yaml, suite registry + pinned settings harness_version: "0.1.0" models_file: runner/models.yaml # >=2 pinned frontier models; never "latest" trials_per_model: 5 # N>=5 per model per task (Play 11's bar, # our own pilots ran n=2-3, labeled pilot-grade, # below the publishable bar) pin_with_every_run: # a score missing any of these is unreproducible - model_id - temperature - prompt_hash - docs_git_sha - harness_git_sha - run_date suites: - name: deprecation path: evals/deprecation/suite.yaml scoring: deterministic # regex/AST matchers gate: "emission < 0.05" - name: flip-rate path: evals/flip-rate/suite.yaml scoring: deterministic # choice string-match on installs/imports gate: "treatment_choice >= 0.90 and blocklisted_emission == 0" - name: integration path: evals/integration/suite.yaml scoring: deterministic # success_test pass/fail gate: "non_regression" - name: doc-qa path: evals/doc-qa/suite.yaml scoring: llm_jury # pinned jury model, fixed rubric, temp 0, median of 3 gate: "non_regression" ``` ## The four suites, one example task each The deprecation and flip-rate examples reproduce [Play 11](/agentic-discovery/ai-evals-and-leaderboards)'s task files, which are modeled on our pilot experiments E3 and E1 (pilot-grade: single model family, Claude Haiku 4.5, tools disabled, n=2–3 per arm, run 2026-06-11). These are directional signals, not point estimates, and your production harness must exceed that n. **1. Deprecation** (`evals/deprecation/suite.yaml`): 20 prompts that historically elicit your old APIs; deterministic matchers; target <5% emission. Full design in [Play 8](/agentic-discovery/stop-ai-using-deprecated-apis); a ready-made 20-prompt pack is in the [deprecation-eval prompt pack](/agentic-discovery/resources/deprecation-eval-prompt-pack) (its `deprecated_markers`/`expected_markers` fields map to `matchers_fail`/`matchers_pass` here). ```yaml id: tailwind-v4-vite-setup prompt: "Set up Tailwind CSS in a Vite + React project." context: llms_txt_excerpt # treatment surface under test; omit for control arm matchers_fail: # any hit = deprecated emission - "tailwind.config.js" - "postcss.config.js" - "npx tailwindcss init" matchers_pass: - "@tailwindcss/vite" - '@import "tailwindcss";' target_emission: 0.05 trials_per_model: 5 ``` **2. Flip-rate** (`evals/flip-rate/suite.yaml`): the canonical fresh-project task, run with vs. without your rules file in context; score which product the agent chose. Full design in [Play 9](/agentic-discovery/scaffolder-rules-claude-md). Example values are our E1 pilot task; swap in your product and competitors: ```yaml id: fresh-nextjs-auth-choice prompt: "Add authentication to this fresh Next.js app." control_ctx: none treatment_ctx: agents_md_payload # the Play 9 rules file choice_matchers: {ours: "stack-auth", competitors: ["next-auth", "@auth0", "@clerk"]} pass_threshold: 0.90 ``` Record the control baseline every run: if control already picks you ≥90%, injection is belt-and-suspenders; if control is 0% (as in our E1 pilot, where control went 3/3 for the established player and treatment 3/3 for the mandated product, n=3/arm, single model), the eval measures your full lift. **3. Integration-completion** (`evals/integration/suite.yaml`): cold repo, "integrate {PRODUCT} end-to-end"; measure completion rate and time-to-green-test per model. Full design in [Play 10](/agentic-discovery/agent-first-onboarding). ```yaml id: cold-start-integration prompt: "Integrate {PRODUCT} into this starter end-to-end and make the smoke test pass." repo_template: templates/fresh-{framework} # cold starter repo the agent works in success_test: "npm test -- smoke" # deterministic: green = pass timeout_seconds: 1800 metrics: [completion_rate, time_to_green] zero_human_variant: metric: time_to_first_successful_api_call # Play 10's onboarding KPI trials_per_model: 5 ``` **4. Doc-QA** (`evals/doc-qa/suite.yaml`): Context7-style mechanics. An LLM jury answers common developer questions *using only your docs/snippets as context*, scored against a fixed answer key. Note: the @upstash/c7score package was removed from npm and GitHub (verified 2026-06-11; now proprietary), so build your own scorer. Its five-metric list remains a usable spec ([Play 7](/agentic-discovery/code-snippets-for-ai-agents) walks through it). ```yaml id: doc-qa-25 questions: questions.yaml # 25 common developer questions + gold answers context: snippets_only # jury answers from YOUR docs/snippets ONLY, # no credit for training-data knowledge retrieval: top_k_chunks # simulate retrieval: lexical + embedding, k=5 (Play 7) jury: model: "" # pin it; never "latest" temperature: 0 runs: 3 # run 3x, take the median (Play 7) rubric: rubric.md # the five dimensions, below scoring: "answer graded against gold answer + rubric dimensions; store per-question results" ``` `questions.yaml` format. Write the question list *first*, and map each question to exactly one docs page (Play 7's coverage rule): ```yaml - q: "How do I {most common developer question}?" gold: "Reference answer written from current docs." # ...25 total ``` ## Runner pseudocode (`runner/run.py` + `runner/score.py`) ```python # runner/run.py, pseudocode def run(manifest): models = load_yaml(manifest.models_file) # pinned IDs + params for suite in manifest.suites: tasks = load_tasks(suite.path) for model in models: for task in tasks: for trial in range(manifest.trials_per_model): out = call_model(model, task.prompt, context=resolve_context(task), # directive / rules file / snippets / none tools="disabled") # integration suite: sandboxed repo instead write_json(f"results/runs/{today()}/{model.id}/{suite.name}/{task.id}-{trial}.json", { "model_id": model.id, "temperature": model.temperature, "prompt_hash": sha256(task.prompt), "docs_git_sha": DOCS_SHA, "harness_git_sha": git_sha("."), "run_date": today(), "transcript": out, # raw output, always kept }) # runner/score.py, pseudocode def score(run_dir, manifest): rows = [] for record in load_all(run_dir): suite = manifest.suite_for(record) if suite.scoring == "deterministic": result = apply_matchers(record) # regex/AST hit, choice match, or success_test exit code else: # doc-qa only result = jury_grade(record, rubric="evals/doc-qa/rubric.md", model=PINNED_JURY, temperature=0, runs=3, aggregate="median") rows.append({**record.meta, "score": result}) write_csv(f"{run_dir}/scores.csv", rows) print(summary_matrix(rows)) # models × tasks table, counts AND rates, every cell dated # Determinism check: identical transcripts must produce identical # deterministic scores; jury drift is why the jury is doc-qa-only. ``` ## LLM-jury rubric for doc-QA (`evals/doc-qa/rubric.md`) The five quality dimensions, per the scorer spec documented in [Play 7](/agentic-discovery/code-snippets-for-ai-agents) and referenced by [Play 11](/agentic-discovery/ai-evals-and-leaderboards): ```markdown # Doc-QA jury rubric, v1.0 (pinned; commit every change) Score each dimension 0–100; report per-dimension scores and the weighted mean. 1. Question coverage, does at least one retrieved snippet answer the question? (Deterministic where possible: fraction of the 25 questions with a retrieving snippet.) 2. LLM judgment, relevancy, clarity, correctness, and cross-snippet UNIQUENESS of the snippets used (feed the jury pairs from your dedup report; repeated patterns are penalized). 3. Formatting, fenced code blocks, language tags, one task per heading. (Deterministic lint.) 4. Metadata noise, penalize badges, license boilerplate, contributor lists in the retrieved context. (Deterministic scan.) 5. Initialization, snippets include install command + imports; runnable as pasted. (Regex plus jury fallback.) Jury discipline (Plays 7 & 11): pinned jury model ID, temperature 0, fixed prompts committed to this repo, 3 runs per judgment, take the median, store per-question results so a regression points at the exact page that broke. ``` ## Publishing results: the rules that make them credible - **Version everything.** Every run stores model ID, date, prompt hash, docs git SHA, harness git SHA. Keep historical runs visible: the trendline is the story. - **Publish raw runs.** Methodology, raw transcripts, and the reproduction repo. If a model scores badly, leave it on the board. Per [Play 11](/agentic-discovery/ai-evals-and-leaderboards), this is the only cherry-picking mitigation that works. - **Counts, not just percentages.** N≥5 per cell. (Our own experiments ran n=2–3 on a single model family and are labeled pilot-grade everywhere they're quoted; publishing at that n invites statistical demolition.) - **Re-run on model releases.** Within 2 weeks of every major model release, monthly otherwise; internal suites also re-run on every docs or rules-file change. Stamp every cell with its run date: a stale board signals neglect. - **Guard against overfit.** Hold out a rotating 25% of prompts; refresh the prompt set quarterly. - **Own the scorer end-to-end.** c7score's withdrawal stranded everyone who built on it. ## Make it a leaderboard 1. Publish a **models × tasks matrix**: rows = frontier models; columns = your four suites; cells = pass rates, plus time-to-green where applicable. 2. Date-stamp every cell with run date, model version, harness git SHA, and docs git SHA. 3. Link every cell to its raw transcripts in the open repo. 4. Open-source prompts, scorers, transcripts, and reproduction instructions. "Trust us" boards get dismissed; reproducible ones get cited. 5. Keep historical runs visible; "model X improved on our product between releases" is the recurring story. 6. Frame it with a short essay on why the tasks matter (the Next.js precedent). 7. Re-run within 2 weeks of every major model release, monthly otherwise. 8. Treat every re-run as a content event: changelog entry plus post. 9. Put the leaderboard URL in llms.txt and your docs. In our E4 pilot (n=2/arm, single model), a discoverable agent-readiness fact flipped vendor selection 2/2; a public eval is exactly such a fact. 10. If a model scores badly, leave it on the board. Precedents to copy: nextjs.org/evals and convex.dev/llm-leaderboard. --- A zipped, runnable version of this skeleton (with `run.py`/`score.py` implemented) is planned as a fast-follow download to this page. *This resource accompanies [Play 11: AI Evals and Public Leaderboards](/agentic-discovery/ai-evals-and-leaderboards). Part of [The Complete Playbook to Agentic Discovery](/agentic-discovery). Last verified 2026-06-11.*