I Ate My Own Dog Food: How I Benchmarked AI Skills and Proved Eval-Driven Development Works

#ai #qa #opensource #openclaw

I built a tool to test AI skills. Then I used it on my own project. The benchmarks shocked even me.

As a QA architect, I've spent my career building systems that verify software works correctly. At Apple, we tested everything — every interaction, every edge case, every regression. At CooperVision, I built a Playwright/TypeScript framework from scratch that grew test coverage by 300%.

So when I started working with AI agent skills, I noticed something: nobody was testing them.

You write a SKILL.md file. You try it manually once. Maybe it works for your prompt. You ship it.

There's no automated test suite. No regression testing. No CI pipeline that catches when a description change breaks triggering.

That's a QA problem. I built opencode-skill-creator to solve it.

Then I dogfooded it on a real project. Here's what happened.

The Project: AdLoop Skills for Google Ads

AdLoop is a Google Ads MCP (Model Context Protocol) integration — it connects AI agents to Google Ads and GA4 data through a set of tools.

I created 4 skills for AdLoop using opencode-skill-creator:

adloop-planning — Keyword research, competition analysis, and budget forecasting
adloop-read — Performance analysis, campaign reporting, and conversion diagnostics
adloop-write — Campaign creation, ad management, keyword bidding, and budget changes (spends real money)
adloop-tracking — GA4 event validation, conversion tracking diagnosis, and code generation

The Benchmark Results

opencode-skill-creator's benchmark runs each skill through its eval queries in two configurations:

With skill loaded — the AI has full domain knowledge, safety rules, and orchestration patterns
Without skill — the AI only has bare MCP tool names and descriptions

Skill	Evals	With Skill	Without Skill	Improvement
adloop-write	8	100%	17%	+83pp
adloop-planning	6	100%	21%	+79pp
adloop-read	8	100%	27%	+73pp
adloop-tracking	6	100%	33%	+67pp

But the raw numbers only tell part of the story. The failures without skills aren't just wrong answers — they're dangerous actions.

The Scariest Failure: Real Money at Stake

adloop-write manages campaigns, ads, keywords, and budgets — operations that spend real money. Without the skill:

Added BROAD match keywords to MANUAL_CPC campaigns — the #1 cause of wasted ad spend
Set budget above safety caps ($100 when max is $50) — no guardrail
Deleted campaigns irreversibly without warning — no confirmation, no pause alternative
Batched multiple changes in one call — bypassing review steps

This isn't about "better answers." This is about preventing real financial harm.

GDPR ≠ Broken Tracking

A common scenario: 500 clicks in Google Ads, 180 sessions in GA4. "Is my tracking broken?"

Without the skill, AI diagnosed this as a tracking issue and offered to investigate.

With the skill, AI recognized: "A 2.8:1 ratio is normal with GDPR consent banners. Google Ads counts all clicks. GA4 only counts consenting users. Your tracking is fine."

The #1 false positive in digital marketing analytics, prevented by domain knowledge in the skill.

Don't Trust Google Blindly

Without the skill, AI endorsed Google's recommendations at face value: "Raise budget" with zero conversions. "Add BROAD match" without Smart Bidding.

The skill explicitly states: "Google recommendations optimize for Google's revenue, not yours." It cross-references against conversion data first. The 73% improvement comes from teaching critical thinking, not compliance.

Why This Matters

The same AI model. The same tools. The same prompts. The only variable: whether the skill is loaded. The difference is 67–83 percentage points.

Skills do three things bare tool access doesn't:

Inject domain expertise — GDPR mechanics, budget rules, competition levels
Enforce safety guardrails — budget caps, deletion warnings, one-change-at-a-time
Provide orchestration patterns — when to call which tool, in what order, with what validation

Try It Yourself

npx opencode-skill-creator install --global

Free, open source (Apache 2.0). Works with any of OpenCode's 300+ supported models. Pure TypeScript, zero Python dependency.

→ github.com/antongulin/opencode-skill-creator

Skills are software. Software should be tested.

Anton Gulin is the AI QA Architect — the first person to claim this title on LinkedIn. He builds AI-powered test automation systems where AI agents and human engineers collaborate on quality. Former Apple SDET, current Lead Software Engineer in Test. Find him at anton.qa or on LinkedIn.