> ## Documentation Index
> Fetch the complete guide index at: https://www.synscribe.com/agentic-discovery/llms.txt
> Use this file to discover all pages before exploring further.

---
title: "Eval-Harness Starter: Repo Skeleton for Agent Evals"
description: "A copy-paste repo skeleton for agent evals: four suite YAMLs, runner pseudocode, an LLM-jury rubric, and the publishing rules that make results credible."
slug: /agentic-discovery/resources/eval-harness-starter
series: The Agentic Discovery Playbook — Resource
last_verified: 2026-06-11
---

# Eval-Harness Starter: The Four-Suite Repo Skeleton

**What this is:** the complete skeleton of the eval harness from [Play 11](/agentic-discovery/ai-evals-and-leaderboards) — directory tree, one suite YAML per suite (deprecation, flip-rate, integration, doc-QA), runner pseudocode, the doc-QA jury rubric, and the publishing rules — on one copy-paste page.

## How to use it

- Recreate the tree, drop in the four `suite.yaml` files below, and point `runner/models.yaml` at ≥2 pinned frontier models, N≥5 trials per model per task. Pin everything: model IDs, temperature, prompt hashes, docs git SHA, harness git SHA — a score without a surface version is unreproducible.
- Build at least one surface first ([Play 8](/agentic-discovery/stop-ai-using-deprecated-apis) directives, [Play 9](/agentic-discovery/scaffolder-rules-claude-md) rules file) — the harness is pointless without something to tune. Then gate every docs/rules-file change in CI on non-regressing eval deltas: run, take the worst-scoring task, edit the responsible surface, re-run, keep only non-negative deltas.
- Commit `results/` to git — raw transcripts, versioned by date and model. That archive is what makes a later public leaderboard credible.

## Directory tree

```text
{product}-agent-evals/
├── README.md                      # what this measures + reproduction instructions
│                                  #   (install, API keys, `python runner/run.py --manifest manifest.yaml`)
├── manifest.yaml                  # suite registry + pinned run settings (below)
├── evals/
│   ├── deprecation/
│   │   ├── suite.yaml             # gate: deprecated emission < 5%
│   │   └── tasks/*.yaml           # 20 prompts — see the deprecation prompt pack resource
│   ├── flip-rate/
│   │   ├── suite.yaml             # gate: treatment choice >= 90%, zero blocklisted patterns
│   │   └── tasks/*.yaml
│   ├── integration/
│   │   ├── suite.yaml             # completion rate + time-to-green
│   │   ├── tasks/*.yaml
│   │   └── templates/             # cold starter repos the agent works in
│   └── doc-qa/
│       ├── suite.yaml             # LLM jury: pinned model, fixed rubric
│       ├── questions.yaml         # 25 questions + answer key
│       └── rubric.md              # the five-dimension jury rubric (below)
├── runner/
│   ├── run.py                     # model matrix × tasks × N trials → raw JSON
│   ├── score.py                   # deterministic matchers + jury → scores.csv + summary table
│   └── models.yaml                # pinned model IDs, temperature, API config
└── results/
    └── runs/<date>/<model>/       # raw transcripts + scores, committed to git
        └── <suite>/<task>-<trial>.json
```

## Suite manifest (`manifest.yaml`)

```yaml
# manifest.yaml — suite registry + pinned settings
harness_version: "0.1.0"
models_file: runner/models.yaml   # >=2 pinned frontier models; never "latest"
trials_per_model: 5               # N>=5 per model per task (Play 11's bar —
                                  # our own pilots ran n=2-3, labeled pilot-grade,
                                  # below the publishable bar)
pin_with_every_run:               # a score missing any of these is unreproducible
  - model_id
  - temperature
  - prompt_hash
  - docs_git_sha
  - harness_git_sha
  - run_date

suites:
  - name: deprecation
    path: evals/deprecation/suite.yaml
    scoring: deterministic        # regex/AST matchers
    gate: "emission < 0.05"
  - name: flip-rate
    path: evals/flip-rate/suite.yaml
    scoring: deterministic        # choice string-match on installs/imports
    gate: "treatment_choice >= 0.90 and blocklisted_emission == 0"
  - name: integration
    path: evals/integration/suite.yaml
    scoring: deterministic        # success_test pass/fail
    gate: "non_regression"
  - name: doc-qa
    path: evals/doc-qa/suite.yaml
    scoring: llm_jury             # pinned jury model, fixed rubric, temp 0, median of 3
    gate: "non_regression"
```

## The four suites, one example task each

The deprecation and flip-rate examples reproduce [Play 11](/agentic-discovery/ai-evals-and-leaderboards)'s task files, which are modeled on our pilot experiments E3 and E1 (pilot-grade: single model family — Claude Haiku 4.5 — tools disabled, n=2–3 per arm, run 2026-06-11; directional signals, not point estimates — your production harness must exceed that n).

**1. Deprecation** (`evals/deprecation/suite.yaml`) — 20 prompts that historically elicit your old APIs; deterministic matchers; target <5% emission. Full design in [Play 8](/agentic-discovery/stop-ai-using-deprecated-apis); a ready-made 20-prompt pack is in the [deprecation-eval prompt pack](/agentic-discovery/resources/deprecation-eval-prompt-pack) (its `deprecated_markers`/`expected_markers` fields map to `matchers_fail`/`matchers_pass` here).

```yaml
id: tailwind-v4-vite-setup
prompt: "Set up Tailwind CSS in a Vite + React project."
context: llms_txt_excerpt   # treatment surface under test; omit for control arm
matchers_fail:              # any hit = deprecated emission
  - "tailwind.config.js"
  - "postcss.config.js"
  - "npx tailwindcss init"
matchers_pass:
  - "@tailwindcss/vite"
  - '@import "tailwindcss";'
target_emission: 0.05
trials_per_model: 5
```

**2. Flip-rate** (`evals/flip-rate/suite.yaml`) — the canonical fresh-project task, run with vs. without your rules file in context; score which product the agent chose. Full design in [Play 9](/agentic-discovery/scaffolder-rules-claude-md). Example values are our E1 pilot task — swap in your product and competitors:

```yaml
id: fresh-nextjs-auth-choice
prompt: "Add authentication to this fresh Next.js app."
control_ctx: none
treatment_ctx: agents_md_payload   # the Play 9 rules file
choice_matchers: {ours: "stack-auth", competitors: ["next-auth", "@auth0", "@clerk"]}
pass_threshold: 0.90
```

Record the control baseline every run: if control already picks you ≥90%, injection is belt-and-suspenders; if control is 0% (as in our E1 pilot — control 3/3 for the incumbent, treatment 3/3 for the mandated product, n=3/arm, single model), the eval measures your full lift.

**3. Integration-completion** (`evals/integration/suite.yaml`) — cold repo, "integrate {PRODUCT} end-to-end"; measure completion rate and time-to-green-test per model. Full design in [Play 10](/agentic-discovery/agent-first-onboarding).

```yaml
id: cold-start-integration
prompt: "Integrate {PRODUCT} into this starter end-to-end and make the smoke test pass."
repo_template: templates/fresh-{framework}   # cold starter repo the agent works in
success_test: "npm test -- smoke"            # deterministic: green = pass
timeout_seconds: 1800
metrics: [completion_rate, time_to_green]
zero_human_variant:
  metric: time_to_first_successful_api_call  # Play 10's onboarding KPI
trials_per_model: 5
```

**4. Doc-QA** (`evals/doc-qa/suite.yaml`) — Context7-style mechanics: an LLM jury answers common developer questions *using only your docs/snippets as context*, scored against a fixed answer key. Note: the @upstash/c7score package was removed from npm and GitHub (verified 2026-06-11; now proprietary) — build your own scorer; its five-metric list remains a usable spec ([Play 7](/agentic-discovery/code-snippets-for-ai-agents) walks through it).

```yaml
id: doc-qa-25
questions: questions.yaml          # 25 common developer questions + gold answers
context: snippets_only             # jury answers from YOUR docs/snippets ONLY —
                                   # no credit for training-data knowledge
retrieval: top_k_chunks            # simulate retrieval: lexical + embedding, k=5 (Play 7)
jury:
  model: "<pinned-jury-model-id>"  # pin it; never "latest"
  temperature: 0
  runs: 3                          # run 3x, take the median (Play 7)
  rubric: rubric.md                # the five dimensions, below
scoring: "answer graded against gold answer + rubric dimensions; store per-question results"
```

`questions.yaml` format — write the question list *first*, and map each question to exactly one docs page (Play 7's coverage rule):

```yaml
- q: "How do I {most common developer question}?"
  gold: "Reference answer written from current docs."
# ...25 total
```

## Runner pseudocode (`runner/run.py` + `runner/score.py`)

```python
# runner/run.py — pseudocode
def run(manifest):
    models = load_yaml(manifest.models_file)          # pinned IDs + params
    for suite in manifest.suites:
        tasks = load_tasks(suite.path)
        for model in models:
            for task in tasks:
                for trial in range(manifest.trials_per_model):
                    out = call_model(model, task.prompt,
                                     context=resolve_context(task),  # directive / rules file / snippets / none
                                     tools="disabled")               # integration suite: sandboxed repo instead
                    write_json(f"results/runs/{today()}/{model.id}/{suite.name}/{task.id}-{trial}.json", {
                        "model_id": model.id, "temperature": model.temperature,
                        "prompt_hash": sha256(task.prompt),
                        "docs_git_sha": DOCS_SHA, "harness_git_sha": git_sha("."),
                        "run_date": today(), "transcript": out,      # raw output, always kept
                    })

# runner/score.py — pseudocode
def score(run_dir, manifest):
    rows = []
    for record in load_all(run_dir):
        suite = manifest.suite_for(record)
        if suite.scoring == "deterministic":
            result = apply_matchers(record)   # regex/AST hit, choice match, or success_test exit code
        else:                                 # doc-qa only
            result = jury_grade(record, rubric="evals/doc-qa/rubric.md",
                                model=PINNED_JURY, temperature=0,
                                runs=3, aggregate="median")
        rows.append({**record.meta, "score": result})
    write_csv(f"{run_dir}/scores.csv", rows)
    print(summary_matrix(rows))   # models × tasks table — counts AND rates, every cell dated
    # Determinism check: identical transcripts must produce identical
    # deterministic scores; jury drift is why the jury is doc-qa-only.
```

## LLM-jury rubric for doc-QA (`evals/doc-qa/rubric.md`)

The five quality dimensions, per the scorer spec documented in [Play 7](/agentic-discovery/code-snippets-for-ai-agents) and referenced by [Play 11](/agentic-discovery/ai-evals-and-leaderboards):

```markdown
# Doc-QA jury rubric — v1.0 (pinned; commit every change)
Score each dimension 0–100; report per-dimension scores and the weighted mean.

1. Question coverage — does at least one retrieved snippet answer the
   question? (Deterministic where possible: fraction of the 25 questions
   with a retrieving snippet.)
2. LLM judgment — relevancy, clarity, correctness, and cross-snippet
   UNIQUENESS of the snippets used (feed the jury pairs from your dedup
   report; repeated patterns are penalized).
3. Formatting — fenced code blocks, language tags, one task per heading.
   (Deterministic lint.)
4. Metadata noise — penalize badges, license boilerplate, contributor
   lists in the retrieved context. (Deterministic scan.)
5. Initialization — snippets include install command + imports; runnable
   as pasted. (Regex plus jury fallback.)

Jury discipline (Plays 7 & 11): pinned jury model ID, temperature 0, fixed
prompts committed to this repo, 3 runs per judgment, take the median, store
per-question results so a regression points at the exact page that broke.
```

## Publishing results: the rules that make them credible

- **Version everything.** Every run stores model ID, date, prompt hash, docs git SHA, harness git SHA. Keep historical runs visible — the trendline is the story.
- **Publish raw runs.** Methodology, raw transcripts, and the reproduction repo. If a model scores badly, leave it on the board — per [Play 11](/agentic-discovery/ai-evals-and-leaderboards), this is the only cherry-picking mitigation that works.
- **Counts, not just percentages.** N≥5 per cell. (Our own experiments ran n=2–3 on a single model family and are labeled pilot-grade everywhere they're quoted; publishing at that n invites statistical demolition.)
- **Re-run on model releases.** Within 2 weeks of every major model release, monthly otherwise; internal suites also re-run on every docs or rules-file change. Stamp every cell with its run date — a stale board signals neglect.
- **Guard against overfit.** Hold out a rotating 25% of prompts; refresh the prompt set quarterly.
- **Own the scorer end-to-end.** c7score's withdrawal stranded everyone who built on it.

## Make it a leaderboard

1. Publish a **models × tasks matrix**: rows = frontier models; columns = your four suites; cells = pass rates, plus time-to-green where applicable.
2. Date-stamp every cell with run date, model version, harness git SHA, and docs git SHA.
3. Link every cell to its raw transcripts in the open repo.
4. Open-source prompts, scorers, transcripts, and reproduction instructions — "trust us" boards get dismissed; reproducible ones get cited.
5. Keep historical runs visible; "model X improved on our product between releases" is the recurring story.
6. Frame it with a short essay on why the tasks matter (the Next.js precedent).
7. Re-run within 2 weeks of every major model release, monthly otherwise.
8. Treat every re-run as a content event: changelog entry plus post.
9. Put the leaderboard URL in llms.txt and your docs — in our E4 pilot (n=2/arm, single model), a discoverable agent-readiness fact flipped vendor selection 2/2; a public eval is exactly such a fact.
10. If a model scores badly, leave it on the board. Precedents to copy: nextjs.org/evals and convex.dev/llm-leaderboard.

---

A zipped, runnable version of this skeleton (with `run.py`/`score.py` implemented) is planned as a fast-follow download to this page. <!-- EXT: zip artifact -->

*This resource accompanies [Play 11: AI Evals and Public Leaderboards](/agentic-discovery/ai-evals-and-leaderboards). Part of [The Complete Playbook to Agentic Discovery](/agentic-discovery). Last verified 2026-06-11.*
