close

DEV Community

Cover image for I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works
Anton Gulin
Anton Gulin

Posted on

I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works

I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me.

As a QA architect, I've spent my career building systems that verify software works correctly. At Apple, we tested everything — every interaction, every edge case, every regression. At CooperVision, I built a Playwright/TypeScript framework from scratch that grew test coverage by 300%.

So when I started working with AI agent skills, I noticed something: nobody was testing them.

You write a SKILL.md file. You try it manually once. Maybe it works for your prompt. You ship it.

There's no automated test suite. No regression testing. No CI pipeline that catches when a description change breaks triggering.

That's a QA problem. I built opencode-skill-creator to solve it.

Then I dogfooded it on a real project. Here's what happened.

The Project: AdLoop Skills for Google Ads

AdLoop is a Google Ads MCP (Model Context Protocol) integration — it connects AI agents to Google Ads and GA4 data through a set of tools.

I created 4 skills for AdLoop using opencode-skill-creator:

  1. adloop-planning — Keyword research, competition analysis, and budget forecasting
  2. adloop-read — Performance analysis, campaign reporting, and conversion diagnostics
  3. adloop-write — Campaign creation, ad management, keyword bidding, and budget changes (spends real money)
  4. adloop-tracking — GA4 event validation, conversion tracking diagnosis, and code generation

The Benchmark Results

opencode-skill-creator's benchmark runs each skill through its eval queries in two configurations:

  • With skill loaded — the AI has full domain knowledge, safety rules, and orchestration patterns
  • Without skill — the AI only has bare MCP tool names and descriptions
Skill Evals With Skill Without Skill Improvement
adloop-write 8 100% 17% +83pp
adloop-planning 6 100% 21% +79pp
adloop-read 8 100% 27% +73pp
adloop-tracking 6 100% 33% +67pp

But the raw numbers only tell part of the story. The failures without skills aren't just wrong answers — they're dangerous actions.

The Scariest Failure: Real Money at Stake

adloop-write manages campaigns, ads, keywords, and budgets — operations that spend real money. Without the skill:

  • Added BROAD match keywords to MANUAL_CPC campaigns — the #1 cause of wasted ad spend
  • Set budget above safety caps ($100 when max is $50) — no guardrail
  • Deleted campaigns irreversibly without warning — no confirmation, no pause alternative
  • Batched multiple changes in one call — bypassing review steps

This isn't about "better answers." This is about preventing real financial harm.

GDPR ≠ Broken Tracking

A common scenario: 500 clicks in Google Ads, 180 sessions in GA4. "Is my tracking broken?"

Without the skill, AI diagnosed this as a tracking issue and offered to investigate.

With the skill, AI recognized: "A 2.8:1 ratio is normal with GDPR consent banners. Google Ads counts all clicks. GA4 only counts consenting users. Your tracking is fine."

The #1 false positive in digital marketing analytics, prevented by domain knowledge in the skill.

Don't Trust Google Blindly

Without the skill, AI endorsed Google's recommendations at face value: "Raise budget" with zero conversions. "Add BROAD match" without Smart Bidding.

The skill explicitly states: "Google recommendations optimize for Google's revenue, not yours." It cross-references against conversion data first. The 73% improvement comes from teaching critical thinking, not compliance.

Why This Matters

The same AI model. The same tools. The same prompts. The only variable: whether the skill is loaded. The difference is 67–83 percentage points.

Skills do three things bare tool access doesn't:

  1. Inject domain expertise — GDPR mechanics, budget rules, competition levels
  2. Enforce safety guardrails — budget caps, deletion warnings, one-change-at-a-time
  3. Provide orchestration patterns — when to call which tool, in what order, with what validation

Try It Yourself

npx opencode-skill-creator install --global
Enter fullscreen mode Exit fullscreen mode

Free, open source (Apache 2.0). Works with any of OpenCode's 300+ supported models. Pure TypeScript, zero Python dependency.

github.com/antongulin/opencode-skill-creator

Skills are software. Software should be tested.


Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test. Find him at anton.qa or on LinkedIn.

Top comments (0)