What Every AI Coding Tool Gets Wrong (And What to Do About It)

#productivity #ai #devtools #opensource

Part 3 of the AI Coding Tools Deep Dive. Parts 1 and 2 covered every tool and how to run them free. This one asks the question nobody's asking.

I've now tested or researched 30+ AI coding tools. They're all good. Some are great. But they all share the same blind spot.

The Blind Spot

Pick any tool from Parts 1 and 2. Ask it a simple question:

"Is the AI getting better at helping me?"

Not "is the model improving" (that's Google/Anthropic's/OpenAI's problem). Is YOUR instance, in YOUR codebase, with YOUR patterns, actually producing better results this week than last week?

No tool can answer this. Not Claude Code. Not Cursor. Not Aider. Not Gemini CLI. Not any of the 30+ tools I surveyed.

They all produce output. None of them measure quality. The AI is always 100% confident, and nobody checks.

What "Getting Better" Would Require

To know if your AI coding assistant is improving, you'd need to track:

What it investigated vs what it assumed. Did it read the code before editing, or pattern-match from training data? There's no log.
Whether its confidence matched reality. It said it understood the module. Did the tests pass? Nobody compares.
What it learned across sessions. After 100 sessions in your codebase, does it make fewer mistakes? There's no measurement.
What it didn't know it didn't know. The most dangerous bugs come from areas the AI never investigated. There's no mechanism to surface these blind spots.

This isn't a feature request. It's a category of infrastructure that doesn't exist in any of these tools.

Why This Matters For You

If you're a solo developer, the cost of an uncalibrated AI is your debugging time. The AI introduces a subtle bug it was "confident" about, and you spend an hour finding it.

If you're a team lead, it's worse. Your AI-assisted PRs look right, pass review, and the regression shows up in production — because nobody measured whether the AI actually understood the code it changed.

If you're building AI-assisted workflows, it's compounding. The AI makes the same class of mistake on day 100 that it made on day 1, because nothing in the system tracks whether its predictions are improving.

What Measurement Would Look Like

Imagine every AI coding session had three checkpoints:

Before (PREFLIGHT): The AI declares what it thinks it knows. "I understand the auth module at 60% confidence. I'm uncertain about the session store."

Gate (CHECK): After investigating, before acting. "I've read the middleware chain, logged what I found, identified two unknowns. My confidence is now 82%. Ready to implement."

After (POSTFLIGHT): The work is measured. Tests pass or fail. Linter reports clean or dirty. Git diff shows what actually changed. The AI's confidence claim is compared against this evidence.

The delta between Before and After is the learning. The gap between the AI's claim and the evidence is the calibration score. Over time, both should improve. If they don't, the tool isn't getting better — it's just getting more confident.

My Stack (Honest Assessment)

After six months of testing everything:

Claude Code (Pro, $20/mo) — for serious multi-file work. Best reasoning, 1M context.
Gemini CLI (free) — for quick questions and one-shots. 1,000 requests/day is generous.
Aider (BYOK) — for pair-programming sessions where I want clean git history.
Ollama + Qwen 2.5 Coder 32B (local, free) — for offline work and privacy-sensitive repos.
Empirica — for measuring whether any of the above is actually getting better.

That last one is ours. Open source (MIT). It hooks into Claude Code (and eventually any tool with a hook system) to track epistemic vectors, gate actions behind investigation, and verify self-assessments against deterministic evidence.

I'm not going to pitch it here — if you've read this far, you either see the gap or you don't. The code is on GitHub. The technical deep-dive is on Dev.to.

The Bottom Line

The tool wars are over. They're all good enough. The real question isn't which tool to pick — it's whether you're measuring if the tool is actually making you better.

Right now, nobody is. That's the gap.

This completes the AI Coding Tools Deep Dive: