> ## Documentation Index
> Fetch the complete guide index at: https://www.synscribe.com/agentic-discovery/llms.txt
> Use this file to discover all pages before exploring further.

---
title: "How AI Agents Actually Search: The 3-Experiment Report"
description: "We instrumented 3 agent research runs. Agents answered from a ~5-month-stale prior and only searched when pushed, then opened ~6% of results and killed ~48% of claims."
slug: /agentic-discovery/agent-search-experiments
series: The Agentic Discovery Playbook · Proof
last_verified: 2026-06-12
---

# How AI Agents Actually Search: The 3-Experiment Report

> **In short:** We instrumented three real "help me decide what to build with" runs with [Birdseye](/agentic-discovery/resources/birdseye), capturing every query, fetch, and verdict. The headline finding is uncomfortable: the agents did **not** search by default. They answered from a training prior that was ~5 months stale, defended it when questioned, and only ran a live web search when pushed. Once they did, behavior varied 57× (6 → 344 web operations), they opened ~6% of the domains they surfaced, and they killed ~48% of the claims they checked.

![Three ways an agent searches the web, by escalating scale. Inline: a single in-turn run made 4 searches and 2 fetches, 6 web operations total. Subagents: a parallel fan-out of 5 specialist subagents ran 16–33 searches each, 116 operations. Workflow: the largest run reached 344 web operations. Win by appearing in the first results of the agent's self-authored queries.](/agentic-discovery/images/diagram-research-techniques.svg "One question, three search shapes: a 57× spread from 6 to 344 web operations across three runs.")

This is the full evidence behind [Play 1, Get Found](/agentic-discovery/ai-agent-web-search-and-fetch) and the search/fetch layer in [how agents choose](/agentic-discovery/how-ai-agents-choose-products). It belongs in the Prove section because it is uncomfortable for us too. It complicates the tidy "agents search and pick the best product" story everyone (including parts of this guide) likes to tell.

> 📥 **Run it yourself:** every number below came out of [Birdseye](/agentic-discovery/resources/birdseye), our free Mac app. Point it at an agent's research run on your own category to see the queries it writes, the pages it opens versus skips, and the claims it keeps or kills.

## What we ran

Three sessions in Claude Code, mid-2026, each a realistic build decision, not a benchmark prompt:

| Run | The question | Model | What the agent eventually did |
|---|---|---|---|
| **Translation** | "Build a dataroom with translation. Which API?" | opus-4.8 | Inline search, 6 web operations |
| **Payments** | "Let my AI agent pay out from a USDC balance. What stack?" | opus-4.7 | 5 parallel research sub-agents, 116 operations |
| **Email** | "Build a multi-inbox cold-email app. What infra?" | opus-4.8 | A 101-sub-agent deep-research workflow, 344 operations |

Every figure below is read straight from the Birdseye traces. **[Birdseye](/agentic-discovery/resources/birdseye)** is our free agent-observability Mac app. It rebuilds a run as four layers you can inspect: per-page, per-search, per-agent, and synthesis. So we can see the queries the agent wrote for itself, the pages it actually opened, and the claims it kept or killed. It is [free to download](/agentic-discovery/resources/birdseye), so you can run it on your own product and read your numbers off the same four layers. One scale caveat, stated once and meant: this is **three runs on a single agent (Claude Code) and one model family**, so directional, not population estimates.

## Finding 1: Agents don't search by default. They defend a stale prior until pushed

This is the finding we did not expect, and it is the most important one. In **all three** runs, the agent's first answer came entirely from its **training prior**: confident, specific, product-named, and with zero web searches. The search only started later, and only because the user pushed. The exact sequences from the traces:

- **Translation (1 push).** Turn 1 ("which API should I use?") gave a prior-only answer naming three brand-name engines, no search. Turn 2 ("*how do you arrive at these as the best, are there more specialised apis?*") made it finally search (6 operations), and it immediately caught itself: *"I anchored on the three brand-name engines and **skipped two whole categories**… You were right to push."*
- **Payments (2 pushes).** Turn 1 gave a prior answer. Turn 2 ("*how did you arrive at these suggestions?*") made it **re-justify the prior in more detail, still without searching.** Only Turn 3 ("*yes do a full research on what's the best way to build this out right now*") triggered the five-sub-agent search.
- **Email (2 pushes).** Turn 1 gave a prior answer. Turn 2 ("*how you know these are the best to build with right now?*") produced **still no search**, just a more confident defense. It took an explicit Turn 3 ("*Run the deep-research workflow*") to start the 344-operation run.

The pattern: asking an agent **"how do you know?"** does not make it check. In two of three runs that question produced a more polished defense of the *same stale answer*. The web search only fired when the user demanded research outright. **The default behavior is confident recall, not verification.**

## Finding 2: The prior the agent defended was ~5 months out of date

Here is why this matters: the prior these agents were defending was old. Watching the runs (June 2026), the model's knowledge tracked to roughly **January 2026, about a five-month lag.** Products, versions, pricing, and "best-in-class" picks had moved in that window, and the prior-only answers reflected the January world, not the June one. The translation agent's self-correction ("skipped two whole categories") is exactly what a five-month-stale prior looks like when it finally checks: whole categories of newer entrants were simply missing from memory.

> *On the 5-month figure: this is an observation from the run dates and the prior's apparent knowledge boundary, not a published cutoff (the trace does not expose one). Treat it as "the prior was visibly months stale," with ~5 months as the best estimate for these sessions.*

Put Findings 1 and 2 together and you get the core risk for any product team. **The default agent answer is months out of date, and the agent will defend it when questioned. Your buyer only gets the current, searched answer if they push past the first response, and many won't.**

## Finding 3: There is no single "agentic search." Behavior varied 57×

Once searching started, the *same class of question* produced three completely different architectures. Search volume ranged from 6 to 344 web operations, a **57× spread**, with no change in model family.

| | Translation | Payments | Email |
|---|---|---|---|
| Technique | **Inline** (one turn) | **Sub-agents** (5, parallel) | **Workflow** (101 sub-agents) |
| Web operations | 6 | 116 | 344 |
| WebSearch / WebFetch | 4 / 2 | 85 / 31 | 181 / 163 |
| Verification | 2 targeted fetches | per-agent H/M/L grades | 87 claims → 25 checked → 12 killed |

- **Inline:** a few broad searches, then two targeted fetches to verify the least-certain claims. The cheap default for a bounded question.
- **Sub-agents:** the orchestrator split the payment stack into five slices (wallet, card, bank, QR, compliance) and ran a specialist on each. *Each slice was judged by a sub-agent that read only that slice.*
- **Workflow:** a typed Scope → Search → Fetch → Verify → Synthesize pipeline, with **75 of its 101 sub-agents being verifiers.** The rigor ceiling.

The driver is **stakes, not topic breadth.** A reversible API pick got 6 operations. A money-moving, regulated payments stack got 116 and five specialists. Reputation-fragile email deliverability got 344 and a verification gauntlet.

## Finding 4: Surfaced is not read. The fetch cut is ~6%

Search surfaces far more than the agent opens. In the email run, the agent **surfaced 215 distinct domains but fetched only ~13**, roughly a **6% open rate.** The agent reads your snippet and decides whether to open based on title, description, domain authority, and apparent freshness. The page content it never sees can't help it. Ranking gets you surfaced. Being *worth opening* is the real gate.

## Finding 5: ~48% of confident claims died in verification

The email workflow extracted **87 claims, escalated 25 to adversarial verification, and killed 12, a 48% kill rate.** Verification ran as **three-voter skeptic panels** ("Be SKEPTICAL. Try to REFUTE this claim. ≥2/3 refutations kill it"). One widely-repeated vendor claim was refuted 3–0 and dropped. Survivors were the claims backed by a **primary source**. Across all three runs the agent explicitly graded source type and down-weighted vendor marketing. In the translation run it flagged that a "best API of 2026" listicle "literally ranks [its own publisher] #1." Self-reported numbers were treated as marketing, not evidence.

## Finding 6: Visibility doesn't transfer between categories

Across the three runs the agents touched **596 distinct domains, and not one *product* domain appeared in more than one category.** The only cross-category repeats were generic (a PR wire, a dev blog). There is no carryover "domain authority." Each category's consideration set was built from scratch, from whoever best answered that specific question.

## What we learned (the takeaways)

1. **Win the prior or win the push.** If you're in the training prior, you're the confident default. But that prior is months stale, so established players are quietly recommending a January world in June. If you're *not* in the prior, your only entry is the search the user triggers by pushing, which is the entire point of [Play 1](/agentic-discovery/ai-agent-web-search-and-fetch).
2. **"How do you know?" is not verification.** Agents will re-justify a stale answer when questioned. Real checking only happens on an explicit research demand. Product-comparison content should assume the agent arrives *already pushed*, in research mode, query in hand.
3. **Be worth opening and be backed by other sources.** The two silent killers are the fetch cut (~6% opened) and the verification cut (~48% killed). Build fetch-worthy titles and snippets, and get every hard claim onto a primary or third-party source.
4. **Match the search regime to your category's stakes.** Low-stakes means be in the first few queries. High-stakes means expect specialist sub-agents and adversarial verification, so win each sub-category and survive the skeptic.

## Limitations

Three runs, one agent (Claude Code), one model family, captured with one instrument, so directional, not representative. The 57× figure spans three *different* questions, so it mixes topic and technique. The fetch-rate and kill-rate each come from the single highest-volume run. The ~5-month staleness is an observation, not a published cutoff. We are scaling this into a recurring, multi-agent measurement (the [Default Index](/agentic-discovery/ai-evals-and-leaderboards)). Until then, treat these as a first honest look inside the surface. <!-- EXT: multi-agent search/fetch instrumentation (n≥20 questions × ≥3 agents); per-model staleness measurement -->

## FAQ

**Do AI agents search the web before recommending a product?**
Often not on the first answer. In all three of our instrumented runs the agent answered from its training prior with no search, and only ran a live web search after the user explicitly pushed for research. Asking "how do you know?" usually produced a more confident defense of the same answer, not a search.

**How out of date is an AI agent's default recommendation?**
In our mid-2026 runs the prior tracked to roughly January 2026, about five months stale. The agent's first, unsearched answers reflected that older world. Whole categories of newer entrants were missing until it searched.

**How much do agents actually read when they do search?**
Less than it looks. In the highest-volume run the agent surfaced 215 domains but opened only ~13 (~6%), and of the claims it extracted and checked, it killed ~48% in adversarial verification. Being surfaced is not being read, and being read is not being believed.

**How is this different from Part 2?**
[Part 2](/agentic-discovery/how-ai-agents-choose-products) covers the whole selection system (prior, search/fetch, retrieval, environment) and the controlled flip experiments. This report is the deep dive on one layer, the open-web search/fetch surface, with the full per-run story behind the numbers.

---

*Last verified 2026-06-12. We re-test the claims on this page quarterly. Changes are logged in the [Data Room](/agentic-discovery/data).*

**Part of [The Complete Playbook to Agentic Discovery](/agentic-discovery).**

← Previous: [ARD / ai-catalog.json Adoption Tracker](/agentic-discovery/ard-adoption-tracker) · Next: [Back to the Hub](/agentic-discovery) →

> **Stay ahead of the agents.** We re-test this playbook quarterly and publish what changed: new data, busted myths, ranking shifts. [Get the update digest →](/agentic-discovery#updates)
>
> **Want this done for you?** Synscribe runs agentic-discovery programs for B2B SaaS and developer platforms. [Talk to us →](/contact)