Your agents could be failing silently right now. Find out in 2 min →

Benchmarks

I Gave My Coding Agent Eyes Three Different Ways. Here's the Honest Scorecard.

Playwright MCP, Chrome DevTools MCP, and Iris all let an AI coding agent see a running web app. I built Iris, so I ran a committed benchmark across all three and wrote down where each one wins and where mine loses.

Divyanshu Shekhar· Founder, Syrin
17 min read
I Gave My Coding Agent Eyes Three Different Ways. Here's the Honest Scorecard.
⚖️

Fair warning before you read further: I built one of the three tools in this comparison. That makes me biased, so I am not going to ask you to trust my adjectives. Every number below comes from a benchmark harness that is committed to the repo, including the case where my own tool loses to a screenshot. Skip to the full scorecard if you only want the table.

Jump to it — or keep reading and earn it.

Your coding agent just finished a change. It says "done ✅". You open the browser and the button does nothing.

This happens because the agent never actually saw the running app. It edited code, reasoned about what should happen, and reported success based on the diff. Andrej Karpathy described coding agents as "effectively programming with a blindfold on", and that line stuck with me because it is exactly the failure mode. The model writes plausible code and has no way to check whether the page it just changed behaves the way it claimed.

There are three real ways to take the blindfold off right now, and all three speak the Model Context Protocol so any MCP agent (Claude Code, Cursor, OpenCode) can use them:

Browser drivers
Playwright MCP & Chrome DevTools MCP

Drive a real browser from the outside. The agent navigates, clicks, takes a screenshot or reads the accessibility tree, and inspects network through the Chrome DevTools Protocol. Works on any URL with zero install.

In-app instrumentation
Iris

Embed a dev-only SDK inside your app so the agent reads the program from the inside: store state, the React commit stream, emitted domain events, network cardinality, console, plus the source file:line that rendered an element. Only works on an app you own.

I built Iris. Below is what happened when I put all three through the same harness. I am going to lead with the cases where Iris loses, because if you are reading this on r/programming you have already smelled the vendor pitch coming and you deserve the caveats first.


01

Let's Start Where My Own Tool Loses

A screenshot is the actual rendered frame. That gives Playwright and DevTools a category of bug that an in-app tool genuinely cannot see.

Here is the one I measured. I dropped a stray CSS filter: hue-rotate(90deg) saturate(1.6) onto the html element. That re-tints the entire rendered page. It changed 21,393 pixels, about 2.3% of the screen. Every computed-style property Iris reads (color, backgroundColor, opacity, geometry) stayed byte-for-byte identical, because the filter happens at paint time, after the values Iris can see.

CAUGHTScreenshot diff (Playwright / DevTools)
MISSEDIris always-on read (computed style)
2.3%Of pixels changed, invisible to Iris

This is the screenshot's home turf and Playwright wins it cleanly. Iris only matched the catch when I explicitly drove it through its opt-in iris_visual_diff (which uses CDP under the hood). The always-on, no-install SDK reads computed style, so anything that only shows up in actual pixels (font-load failures, paint order, GPU and compositing bugs) is a real blind spot.

The other places the browser drivers win outright:

  • Any site, zero cooperation. Playwright and DevTools test a URL you have never touched. Iris has to embed @syrin/iris-browser, so it physically cannot test a third-party site you do not ship code into.
  • Trusted native input. Playwright drives real CDP input: native keyboard, mouse, file pickers, drag-and-drop, all with isTrusted: true. Iris defaults to synthetic dispatch and real input is opt-in only.
  • Cross-browser. Playwright runs WebKit, Firefox, and Chromium. Iris runs on whatever single engine your app runs on.
  • Browser-level scope. Multi-tab, popups, cross-origin, downloads, auth dialogs, and network mock/intercept all live at the browser level. Iris is scoped to one page runtime, so it observes network but mocking stays the app's job.
  • Protocol-level debugging. Chrome DevTools MCP speaks raw CDP, so for low-level network and performance traces on any site it is the right tool. Iris only sees app-level network, so the wire protocol is out of its reach.

So if you are testing someone else's site, running a cross-browser release matrix, or chasing a pixel-level visual regression, stop reading and use Playwright or DevTools. That is genuinely their job. Iris is built for a narrower situation: an agent building an app you own, verifying its own work on every edit.

A stray CSS filter re-tinted 2.3% of the page. The screenshot diff caught it; Iris's always-on computed-style read missed it.A stray CSS filter re-tinted 2.3% of the page. The screenshot diff caught it; Iris's always-on computed-style read missed it.

Now the other direction.


02

The Token Cost of Looking

When a browser-driver agent wants to know what is on the page, the common move is to dump the full accessibility tree (or a screenshot) and let the model read it. That is a lot of tokens for a question as small as "did the modal open".

Iris asks narrow questions instead. A query plus an observe plus an assert is a few small structured calls.

Per verify stepTokens
Full accessibility-tree snapshot (e.g. Playwright MCP)~7,300
Iris verify loop (query + observe + assert)~100

Across a 20-step flow that is roughly 2,000 tokens with Iris against roughly 146,000 with full-tree snapshots.

Here is the honest asterisk, because this number gets misread. If you force Iris to also dump the whole accessibility tree, the gap shrinks to about 1.8×. The big multiple comes from Iris usually not needing the whole tree to answer the question, so most of the saving is architectural rather than a magic compression trick. I would rather you hear that from me than catch it yourself.


03

The Number That Actually Compounds: Re-Running a Regression

A single verification is interesting. The thing a test suite actually does is run the same verification over and over, on every commit, forever. That repetition is where the gap stops being a percentage and starts being orders of magnitude.

A screenshot or DOM agent has to re-drive the whole flow with the LLM on every single run. The model clicks through, reads the page, and judges the result again, paying full token cost each time. Iris records the flow once and replays it with no model in the loop: it re-resolves the anchors, advances the clock, and checks the declared consequence.

Re-verify a known flowCost per runFlakevs Iris
Iris deterministic replay~175 tok0%baseline
Playwright / DevTools (LLM re-drive)~30,000 toksampled128-184× more
A 4-flow suite (iris_flow_verify)~47 tok, flat in K0%2,574×

Two things matter here. First, the per-run cost is two orders of magnitude apart and it compounds with every commit. Second, and I think this is the bigger deal, the flake rate. I replayed the same flow eight times and got one status and one verdict, identical across all eight runs, because there is no LLM sampling temperature or tool ordering to vary. A browser-driver agent re-drives with a live model every run, so its verdict is sampled rather than fixed. Flakiness is the number one tax on any regression suite, and a model-free replay pays zero of it by construction.

The suite-scale number (47 tokens whether you have 2 flows or 4) comes from consolidating into one verdict: passing flows are counted, only failures get detailed. That is the only place a within-field 100× multiple is physically real. You cannot catch 100× more bugs than exist, but you can re-verify a suite 2,574× cheaper.

Re-verifying one known flow costs about 175 tokens with Iris's model-free replay against roughly 30,000 when a browser-driver agent re-drives with an LLM. 128 to 184 times cheaper per run, 2,574 times across a four-flow suite, at 0% flake.Re-verifying one known flow costs about 175 tokens with Iris's model-free replay against roughly 30,000 when a browser-driver agent re-drives with an LLM. 128 to 184 times cheaper per run, 2,574 times across a four-flow suite, at 0% flake.


04

Some Bugs Never Reach the DOM at All

The DOM is a lossy projection of your program. Plenty of real bugs live in the gap between what the program did and what the page shows. A tool standing outside the browser, looking at pixels or markup, has no way to see them. Iris sits in the runtime, so it can.

I built a small suite of bugs that look completely fine on screen and measured who caught what:

  • UI lies about the store. The displayed count says one thing, the store holds another. A screenshot and a DOM read both report the visible value and have no source of truth to contradict it. Iris caught 2 of 2; the browser drivers caught 0 of 2.
  • Dead handler, green button. A "Ship" button is present, clicks fine, and changes nothing in the store. Iris declares the success condition as deployments.0.status == live, so the replay fails when the store never moved, with no element drift to hide behind.
  • Double-submit. An action that fires its network request twice passes any presence check. A declared net { count: 1 } consequence catches the second call. A count: 0 rule catches a forbidden call, like a reverted migration endpoint or a stray privacy beacon that should never fire.
  • Silent console error. The action logs a console.error and the page still renders. A console { absent: true } consequence fails on it, read after the action settles so it cannot pass before the error fires.
  • Wasted-render storm. A React component thrashing at 108 commits per second while producing identical output. There is no DOM mutation, so an outside tool sees an idle page. Iris reads the commit rate off the devtools hook (108 vs a healthy 36) for about 50 tokens.
  • Blast radius. An action corrupts an unrelated piece of store state for a view that is not even on screen. Nothing visible changes anywhere. This is the deepest one, and no out-of-page tool can see it because the damage lives entirely in program state.

I want to be precise about the honest boundary here, because "Iris sees state" can sound like more than it is. Raw network counting and raw console capture are parity: Playwright has route and page.on('console'), DevTools reads both too. The advantage is not that Iris can observe these. It is that the count or the clean console is a declared, deterministic consequence of a replayed flow, checked after settle, that cannot be faked by a locator that healed onto the wrong element. The genuinely Iris-only catches are the ones that require reading the store: the UI-vs-store desync, the dead-handler oracle, the blast radius, and the render storm.

There is also a time dimension. Iris controls the app's setTimeout and Date, so a flow gated on a 2.6-second transition (a deploy going from "building" to "live") verifies in about 202 milliseconds by freezing and advancing the clock. The browser drivers are perfectly capable of testing this, but they have to sleep through real time and guess the duration. Under-wait and the test goes flaky. That gap scales with the timer: a 5-minute timeout makes it roughly 1000×.


05

Where It's Just a Tie (and I Should Say So)

Marketing pages love to imply total dominance. The honest reality is that a large slice of web bugs are things a human can see on screen, and any tool with a JavaScript evaluate can reach those.

Every visually-observable bug I tested (computed style, geometry, occlusion, color, theme, six bugs in total) came back 6 of 6 for all three tools. Iris is more ergonomic there because it is one native call instead of authoring a JS snippet (roughly 117 to 259 fewer tokens per check), but it is not more capable. A tie is a tie.

In a live gpt-4o tool-use loop over five broken-app scenarios, using authoritative model usage tokens rather than a proxy:

5 / 5Iris (~55k tokens)
4 / 5Playwright MCP (~30k)
3 / 5Chrome DevTools MCP (~32k)

Iris caught the most, but it spent about 1.7× the tokens to do it on this particular run. That is the trade in a single fresh detection loop. The token advantage I keep talking about only shows up once you start re-running the same flow. One model, one turn budget, five scenarios is a small sample, so treat the accuracy ordering as directional rather than gospel.


06

So Which One Should You Actually Use

Pick by what you are actually doing. Ignore which vendor wrote the blog. Plenty of teams run both: Iris for the inner build-verify loop, Playwright for the cross-browser release gate.

Your situationReach forWhy
Agent building a React or Next app you own, verifying each editIrisIn-loop, ~100 tok per check, sees program state and the source file:line, refuses risky clicks
Regression suite you re-run on every commit or in CIIrisDeterministic replay: 0% flake, ~47-175 tok per run, 128-2574× cheaper than re-driving
Bug whose truth is in state (UI-vs-store desync, double-submit, side-effects)IrisNo out-of-page tool can see these; they live in the program
Testing a third-party site you do not own or cannot modifyPlaywrightIris must embed a dev-only SDK; it cannot instrument code you do not ship
Cross-browser matrix across WebKit, Firefox, ChromiumPlaywrightIris runs on whatever single engine your app runs on
Trusted native input: file pickers, drag-drop, real keyboardPlaywrightIris defaults to synthetic dispatch; real input is opt-in only
Pixel or paint visual regression: font-load, paint order, GPUPlaywrightA screenshot is the rendered frame; Iris only reads computed style
Protocol-level network or perf debugging on any siteDevToolsDevTools MCP speaks raw CDP; Iris observes app-level network

The rule of thumb I keep landing on: if you own the app and an agent is building it, Iris is the cheap, deterministic, state-aware inner loop, and the regression suite that does not go flaky. If you are driving someone else's site, many engines, real input, or true pixels, that is Playwright and DevTools territory.


07

The Full Scorecard

One table, wins and ties and losses stated plainly. Every row comes from a committed harness in the repo, and you can reproduce it with pnpm bench.

The full scorecard across Iris, Playwright MCP, and Chrome DevTools MCP: Iris wins the agent loop, regression cost, store-state desync, and render storms; ties on visual bugs; and loses on pixel regressions, no-install reach, and cross-browser.The full scorecard across Iris, Playwright MCP, and Chrome DevTools MCP: Iris wins the agent loop, regression cost, store-state desync, and render storms; ties on visual bugs; and loses on pixel regressions, no-install reach, and cross-browser.

DimensionIrisPlaywright MCPDevTools MCP
Scripted regression detection (10 bugs)10 / 109 / 109 / 10
Live agent loop, 5 broken scenarios5 / 5 (~55k)4 / 5 (~30k)3 / 5 (~32k)
Cost to re-verify a known flow~175 tok, 0% flake~30,249 (LLM re-drive)~32,296
Re-verify a 4-flow suite, per run~47 tok (2,574×)K × ~30,249K × ~32,296
Visual / computed-style / theme bugs (6)6 / 66 / 66 / 6
Pixel / paint regression (CSS re-tint)MISSED (driven: caught)CAUGHTCAUGHT
UI-vs-store desync (2)2 / 20 / 20 / 2
Wasted-render stormDetected (~50 tok)no signalno signal
Time-gated flow (2.6s transition)~202 ms (clock control)real wait ≥2,600 msreal wait ≥2,600 ms
Source localization (component + file:line)stack 4/4, source 4/4CSS selector onlyCSS selector only
Third-party site, no installimpossibleyesyes
Cross-browser (WebKit / Firefox)noyesChromium only
Trusted native inputopt-in CDP onlyyesyes
The caveats, stated up front

The token numbers use an OpenAI BPE proxy (o200k), which is within about ±20% of Anthropic text tokens, except the live agent loop which uses authoritative model usage. That loop is one model (gpt-4o), one turn budget, and five scenarios, so the accuracy ordering could shift on a bigger sample. The competitor results were corrected for apparatus artifacts (an MCP init timeout, a nav-badge mismatch, a result-parsing bug) before reporting. And three Iris bugs got fixed while building this benchmark, which is exactly what a benchmark is supposed to surface.


The One Honest Sentence

If you want the whole thing compressed: Iris ties on anything a person can see in the DOM (with a real and growing cost advantage on re-runs), and it wins where the bug requires seeing the program itself, its state and the same flow replayed deterministically. Playwright and DevTools win on true pixels, any-site reach, real input, and cross-browser.

It is not that one of these is better and the rest are obsolete. They are aimed at different moments. A browser driver is the right tool when you are standing outside an app you do not control. An in-app layer is the right tool when an agent is building an app you own and needs to check its own work cheaply on every edit, then re-check it forever without flaking.

I built Iris because that second moment, the agent's inner loop, did not have a tool that read program truth instead of guessing from pixels. The benchmark above is my attempt to show exactly where that bet pays off and exactly where it does not. Every harness behind these numbers is open source at github.com/syrin-labs/iris, so if you find a case where they do not hold, clone it and show me. If the project looks useful, a star on GitHub helps other people building with agents find it.

See it on your own app in one line.

Iris is dev-only, localhost-only, no telemetry, Apache-2.0 SDK. Paste one line into your agent and it sets itself up.

Or read the quickstart and drop it into your agent →

Star on GitHub ★
Fun Fact

The Bug That Convinced Me to Build This

The whole project started from one stupid afternoon. My agent built a checkout button, clicked it during its own testing, saw a success toast appear, and reported the feature done. The toast was real. The payment request fired twice.

Nothing on screen was wrong. The DOM was correct, the toast was correct, a screenshot would have looked perfect, and an accessibility-tree dump would have passed. The bug lived entirely in the fact that one network call happened a second time. The only way to catch it was to count the request, and the only way to count it reliably on every future commit was to make "fires exactly once" a declared rule that replays without a model second-guessing it.

That double-submit is now scenario one in the benchmark. The lesson I took from it: the bugs that survive an agent's own review are usually the ones that never make it to the rendered page. If your tool can only see what a screenshot sees, those are the exact bugs it will keep waving through.

Coming next in this series

What 'Verify on Every Edit' Looks Like Inside Claude Code

The benchmark is the static picture. The interesting part is the loop: an agent makes a change, Iris checks it from inside the running app, and either confirms it or hands back the file:line to fix, all before you ever look at the browser. Next post walks through a full session.

AI Coding AgentsPlaywright MCPChrome DevTools MCPMCPBrowser AutomationIrisClaude CodeCursorWeb Testing

Continue reading