DEV Community: David Van Assche (S.L)

The Prosodic Memory Layer: How AI Learns Your Voice (and Why It Matters)

David Van Assche (S.L) — Thu, 16 Apr 2026 10:01:06 +0000

Final part of the Epistemic AI series. We've covered the problem, measurement, calibration, and integration. Now: how AI learns to sound like you — and why that matters more than you think.

When you write a Reddit comment, a Dev.to article, and a LinkedIn post about the same topic, you change how you write. Not the facts — the voice. The register shifts. The depth adjusts. The cultural expectations of each platform shape what "good" looks like.

Your AI doesn't know this. It writes the same way everywhere — the same helpful, slightly formal, universally inoffensive tone that's instantly recognizable as machine-generated. That's not a style problem. It's a measurement problem.

The Voice Gap

Every AI writing tool faces the same structural limitation: the model has no memory of how you write, where you're writing, or what has actually worked for your audience before.

This creates three predictable failures:

The identity gap. The AI doesn't know your natural register — whether you lead with data or analogy, whether you hedge or state directly, whether you use jargon or translate. It defaults to "helpful assistant" because it has no evidence to do otherwise.

The platform gap. Reddit has anti-marketing antibodies. Dev.to rewards show-don't-tell technical narrative. LinkedIn expects professional polish. The AI doesn't adapt to these cultural norms because it doesn't track them. It writes the same way on every platform, and it underperforms on all of them.

The learning gap. After you publish, some content works and some doesn't. Engagement data exists — reactions, comments, reads, saves. But none of that feeds back into the next generation cycle. The AI starts from the same blank slate every time. It never gets better at being you.

The Tri-Axis Model

We built something to solve this. The prosodic memory layer — built on top of Empirica's epistemic measurement infrastructure — tracks writing patterns across three axes:

Axis 1: Creator Voice

Your writing DNA. Not what you say, but how you say it.

The system ingests your actual writing — posts, comments, documentation, emails — and builds a voice profile from real samples. Not a prompt like "write in a casual tone." A statistical model of your natural tendencies:

Register: formal, conversational, technical, casual — and how it shifts by context
Tendencies: "technical-then-analogy," "question-led," "data-first"
Anti-patterns: "corporate-speak," "hype-language," "over-qualifying"
Archetype: founder-engineer, researcher, marketer, writer

Each sample is embedded as a semantic vector with metadata — platform, audience, register, engagement score, topic tags. When the system needs to write as you, it doesn't guess from a prompt. It retrieves your closest real writing for that context and uses it as a stylistic reference.

Axis 2: Platform Adaptation

Each platform has cultural norms that override personal style. The system encodes these as structured profiles:

Dev.to rewards code examples readers can run, "how I built X" narratives, and honest post-mortems. Theory without code underperforms. Thinly disguised product announcements get called out.

Reddit rewards personal experience framing, openly acknowledged uncertainty, and specific technical details. Marketing language gets instant downvotes. Self-promotion without value contribution gets buried.

LinkedIn rewards professional framing, quantified results, and industry-relevant insights. The register is professional but authentic — pure corporate-speak reads as hollow.

When the system generates content, it loads the target platform's profile and adapts the voice accordingly. Same message, different register. Your Dev.to article and your Reddit comment on the same topic should sound like they were written by the same person — on purpose, with intent — not like copy-paste.

Axis 3: Audience Reception

The feedback loop. After publishing, engagement data flows back — reactions, comments, reads, saves — normalized into comparable metrics across platforms.

The system detects which patterns resonate with which audiences:

Which register outperforms on which platform
Which topics consistently drive engagement
Which voice patterns (your real ones) correlate with the best reception

These patterns become findings — logged as Empirica artifacts — and feed into the next content generation cycle. The brief gets richer each time. The AI doesn't just know how you write. It knows how you write when things work.

The Content Brief: Three Layers Merged

When the AI generates content, all three axes merge into a single brief — a structured context document that tells the drafter exactly what it's working with:

PLATFORM: Dev.to
- Cultural norms: technical-narrative, show-don't-tell
- What works here: code examples, honest post-mortems
- Min confidence to post: 70%

ENGAGEMENT DATA:
- Platform average: 0.45 (32 published samples)
- Top topics: epistemic-uncertainty (0.81), calibration (0.67)

CREATOR VOICE:
- Archetype: founder-engineer
- Natural register: technical
- Tendencies: technical-then-analogy, data-before-opinion
- Anti-patterns: corporate-speak, hype-language

The drafter sees all three layers. The result reads like the creator wrote it, adapted for the platform, informed by what actually gets engagement. Not because we fine-tuned a model. Because we gave it the right context — measured, structured, evidence-based.

The Loop That Learns

This is where prosodic memory connects back to the epistemic measurement layer from the rest of this series:

1. Ingest writing samples → semantic vectors (voice model)
2. Build creator profile → structured voice DNA
3. Generate content → informed by 3-layer brief
4. Publish to platform
5. Fetch engagement data → normalized metrics
6. Detect patterns → findings logged as Empirica artifacts
7. Next cycle → brief includes engagement patterns

Each cycle, the brief gets richer. The system learns not just how you write, but how you write when it works. That's the difference between voice matching and voice optimization.

And because it's built on Empirica's artifact system, every insight is traceable. You can see which engagement findings influenced which generation cycle. You can audit why the system chose a particular register. The voice layer is measured, not magical.

Why This Matters Beyond Content

The prosodic memory concept extends beyond writing. Any AI interaction where consistency of approach matters — customer support, legal drafting, medical documentation, financial reporting — has the same structural problem: the AI defaults to its training distribution, not to the human's established patterns.

The tri-axis model is generalizable:

Axis 1 (Creator Voice) → Domain Expert Voice — how this doctor explains diagnoses, how this lawyer drafts contracts
Axis 2 (Platform Adaptation) → Context Adaptation — patient-facing vs. chart notes, client memo vs. filing
Axis 3 (Audience Reception) → Outcome Measurement — patient comprehension, legal precision, compliance rates

The infrastructure is the same. Ingest real samples. Build a profile. Adapt to context. Measure outcomes. Feed back.

The Connection to Calibration

Prosodic memory is grounded calibration applied to voice instead of code.

In the coding context (Parts 1-4 of this series), the AI declares what it knows, then deterministic evidence — tests, linters, git metrics — verifies the claim. The gap between self-assessment and evidence is the calibration signal.

In the voice context, the AI generates content in your voice, then engagement data — reactions, reads, comments — verifies whether the voice worked. The gap between expected performance and actual reception is the voice calibration signal. Same framework, different evidence source.

This is what makes it structural rather than cosmetic. We're not prompt-engineering a tone. We're measuring voice accuracy the same way we measure epistemic accuracy — with falsifiable evidence, tracked over time, compounding in value.

This concludes the Epistemic AI series. All five parts:

Your AI Doesn't Know What It Doesn't Know
Measuring What Your AI Learned
Grounded Calibration vs Self-Assessment
Adding Epistemic Hooks to Your Workflow
The Prosodic Memory Layer (this article)

Empirica is open source (MIT). GitHub

The prosodic memory layer is part of a commercial product built on Empirica's measurement infrastructure. The concepts described here — tri-axis voice modeling, platform adaptation, engagement feedback loops — represent the direction we're building. If you're interested in early access or collaboration, reach out.

What Every AI Coding Tool Gets Wrong (And What to Do About It)

David Van Assche (S.L) — Thu, 16 Apr 2026 09:27:20 +0000

Part 3 of the AI Coding Tools Deep Dive. Parts 1 and 2 covered every tool and how to run them free. This one asks the question nobody's asking.

I've now tested or researched 30+ AI coding tools. They're all good. Some are great. But they all share the same blind spot.

The Blind Spot

Pick any tool from Parts 1 and 2. Ask it a simple question:

"Is the AI getting better at helping me?"

Not "is the model improving" (that's Google/Anthropic's/OpenAI's problem). Is YOUR instance, in YOUR codebase, with YOUR patterns, actually producing better results this week than last week?

No tool can answer this. Not Claude Code. Not Cursor. Not Aider. Not Gemini CLI. Not any of the 30+ tools I surveyed.

They all produce output. None of them measure quality. The AI is always 100% confident, and nobody checks.

What "Getting Better" Would Require

To know if your AI coding assistant is improving, you'd need to track:

What it investigated vs what it assumed. Did it read the code before editing, or pattern-match from training data? There's no log.
Whether its confidence matched reality. It said it understood the module. Did the tests pass? Nobody compares.
What it learned across sessions. After 100 sessions in your codebase, does it make fewer mistakes? There's no measurement.
What it didn't know it didn't know. The most dangerous bugs come from areas the AI never investigated. There's no mechanism to surface these blind spots.

This isn't a feature request. It's a category of infrastructure that doesn't exist in any of these tools.

Why This Matters For You

If you're a solo developer, the cost of an uncalibrated AI is your debugging time. The AI introduces a subtle bug it was "confident" about, and you spend an hour finding it.

If you're a team lead, it's worse. Your AI-assisted PRs look right, pass review, and the regression shows up in production — because nobody measured whether the AI actually understood the code it changed.

If you're building AI-assisted workflows, it's compounding. The AI makes the same class of mistake on day 100 that it made on day 1, because nothing in the system tracks whether its predictions are improving.

What Measurement Would Look Like

Imagine every AI coding session had three checkpoints:

Before (PREFLIGHT): The AI declares what it thinks it knows. "I understand the auth module at 60% confidence. I'm uncertain about the session store."

Gate (CHECK): After investigating, before acting. "I've read the middleware chain, logged what I found, identified two unknowns. My confidence is now 82%. Ready to implement."

After (POSTFLIGHT): The work is measured. Tests pass or fail. Linter reports clean or dirty. Git diff shows what actually changed. The AI's confidence claim is compared against this evidence.

The delta between Before and After is the learning. The gap between the AI's claim and the evidence is the calibration score. Over time, both should improve. If they don't, the tool isn't getting better — it's just getting more confident.

My Stack (Honest Assessment)

After six months of testing everything:

Claude Code (Pro, $20/mo) — for serious multi-file work. Best reasoning, 1M context.
Gemini CLI (free) — for quick questions and one-shots. 1,000 requests/day is generous.
Aider (BYOK) — for pair-programming sessions where I want clean git history.
Ollama + Qwen 2.5 Coder 32B (local, free) — for offline work and privacy-sensitive repos.
Empirica — for measuring whether any of the above is actually getting better.

That last one is ours. Open source (MIT). It hooks into Claude Code (and eventually any tool with a hook system) to track epistemic vectors, gate actions behind investigation, and verify self-assessments against deterministic evidence.

I'm not going to pitch it here — if you've read this far, you either see the gap or you don't. The code is on GitHub. The technical deep-dive is on Dev.to.

The Bottom Line

The tool wars are over. They're all good enough. The real question isn't which tool to pick — it's whether you're measuring if the tool is actually making you better.

Right now, nobody is. That's the gap.

This completes the AI Coding Tools Deep Dive:

Every AI Coding CLI in 2026: The Complete Map
Running AI Coding Agents for Free
What Every AI Coding Tool Gets Wrong (this article)

For the epistemic measurement deep-dive: Epistemic AI Series (5 parts)

Empirica on GitHub — measurement infrastructure for AI. MIT licensed.

Running AI Coding Agents for Free: The Open Source & Local Setup Guide (2026)

David Van Assche (S.L) — Wed, 15 Apr 2026 18:24:44 +0000

Part 2 of the AI Coding Tools Deep Dive. Part 1 mapped every tool. This one shows you how to run them for free — or close to it.

You don't need a subscription to get serious AI coding assistance. Between open-source tools, free APIs, and local models, you can build a professional-grade AI coding stack for $0-15/month. Here's exactly how.

Strategy 1: The Free Cloud Stack ($0/month)

Tools: Gemini CLI + Qwen Code

# Install Gemini CLI
npm install -g @anthropic-ai/gemini-cli
gemini login  # uses your Google account

# 1,000 requests/day with Gemini 2.5 Pro
# That's enough for a full day of coding
gemini "Refactor the auth module to use middleware pattern"

For a second opinion or when you hit Gemini's style limits:

# Qwen Code — completely free API from Alibaba
pip install qwen-code
qwen-code init
# Uses Qwen Coder models, no cost

Cost: $0. Literally.

Limitation: You're dependent on Google's and Alibaba's continued generosity. Free tiers can change without notice.

Strategy 2: The BYOK Power Stack ($5-15/month)

Tools: Aider + OpenRouter (or direct API keys)

# Install Aider
pip install aider-chat

# Option A: Use OpenRouter for model shopping
export OPENROUTER_API_KEY=your-key
aider --model openrouter/anthropic/claude-sonnet-4.6

# Option B: Direct API key (cheaper, fewer models)
export ANTHROPIC_API_KEY=your-key
aider --model claude-sonnet-4.6-latest

Aider's git-native workflow:

cd your-project
aider

# Inside aider:
> Fix the race condition in session_store.py
# Aider reads the file, makes changes, auto-commits with a descriptive message
# You review the diff, accept or reject

Cost: $5-15/month depending on usage. Claude Sonnet 4.6 at $3/$15 per million tokens. Moderate use = ~$10/month.

Why this works: Aider is the most mature CLI coding tool (39K stars, 4.1M installs, 15B tokens processed per week). It handles git, multi-file edits, and test running natively. OpenRouter lets you compare models by switching one flag.

The CLIProxyAPI Hack

If you want to use Gemini's free tier through Aider or any OpenAI-compatible tool:

# CLIProxyAPI wraps Gemini CLI as an OpenAI-compatible endpoint
git clone https://github.com/router-for-me/CLIProxyAPI
cd CLIProxyAPI && pip install -r requirements.txt
python proxy.py  # Starts an OpenAI-compatible server

# Now point Aider at it
export OPENAI_API_BASE=http://localhost:8080/v1
export OPENAI_API_KEY=dummy
aider --model gemini-2.5-pro
# Free Gemini 2.5 Pro through Aider's interface

Strategy 3: The Fully Local Stack ($0/month, offline-capable)

Tools: Ollama + Aider (or Continue.dev)

Step 1: Install Ollama

curl -fsSL https://ollama.ai/install.sh | sh

# Pull a coding model
ollama pull qwen2.5-coder:7b     # 4.5GB, laptop-friendly
ollama pull qwen2.5-coder:32b    # 18GB, desktop with GPU
ollama pull devstral2:24b         # Mistral's coding model

Step 2: Wire It Into Your Tool

With Aider:

aider --model ollama/qwen2.5-coder:32b
# That's it. Fully local, fully private, zero cost.

With Continue.dev (VS Code):

Install the Continue extension
Configure ~/.continue/config.json:

{
  "models": [{
    "title": "Qwen Coder 32B",
    "provider": "ollama",
    "model": "qwen2.5-coder:32b"
  }]
}

With OpenCode:

# OpenCode auto-detects Ollama
opencode --provider ollama --model qwen2.5-coder:32b

Step 3: Model Selection Guide

Your Hardware	Model	Quality	Speed
Laptop (16GB RAM, no GPU)	`qwen2.5-coder:7b`	Good for completions, basic refactoring	~15 tok/s
Desktop (32GB RAM, RTX 3060)	`qwen2.5-coder:32b`	Excellent — rivals cloud models for most tasks	~20 tok/s
Desktop (64GB RAM, RTX 4090)	`devstral2:24b` or `deepseek-coder-v2:33b`	Near-frontier quality	~40 tok/s
Server (80GB+ VRAM)	`glm-5` via vLLM	77.8% SWE-bench — competes with Claude	Production speed

When Local Beats Cloud

Local wins when:

Privacy matters — code never leaves your machine
Latency matters — no network round-trip, instant responses
Cost matters — zero marginal cost per request
Offline works — airplane, air-gapped environments, spotty internet

Cloud wins when:

Quality ceiling matters — Claude/GPT-5 still beat local models on the hardest tasks
Context window matters — local 7B models max at 32K; Claude Code has 1M
Multi-file reasoning matters — large models handle cross-file dependencies better
You value your time — setup is one pip install, not GPU driver debugging

The Honest Take on Local Quality

Local models are genuinely good for:

Code completions and inline suggestions
Single-file refactoring
Writing tests for existing code
Explaining code
Documentation generation

Local models still struggle with:

Multi-file architectural changes (context window limits)
Complex debugging chains (reasoning depth)
Understanding project-wide patterns (needs more context than 32K)

The sweet spot: Use local for the 80% of tasks that are routine, cloud for the 20% that are hard. Your average cost drops from $20/month to $3-5/month.

Strategy 4: IDE + BYOK (Best of Both Worlds)

Tools: Cursor or Zed or Continue.dev + your preferred model

All three support BYOK:

Cursor ($16/mo or BYOK):

Settings → Models → Add Custom Model → Your API key

Zed (free, BYOK):

Settings → AI → Provider → Ollama / Anthropic / OpenAI

Continue.dev (free, any IDE):

VS Code + JetBrains support
Configure any model provider in config.json
Autocomplete, chat, edit, and agent modes
Only tool that works in both IDEs

The $0 Starter Kit

If you're just getting started today and want to spend nothing:

# 1. Gemini CLI for cloud (1000 req/day free)
npm install -g @anthropic-ai/gemini-cli
gemini login

# 2. Ollama for local (zero cost)
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull qwen2.5-coder:7b

# 3. Aider to tie them together
pip install aider-chat

# Cloud mode (Gemini):
export GEMINI_API_KEY=your-free-key
aider --model gemini/gemini-2.5-pro

# Local mode (Ollama):
aider --model ollama/qwen2.5-coder:7b

# Done. Professional AI coding setup. $0.

Next: *Part 3 — What Every AI Coding Tool Gets Wrong** — the measurement gap. None of these tools track whether the AI is actually getting better at helping you.*

Previous: *Part 1 — Every AI Coding CLI in 2026: The Complete Map***

Every AI Coding CLI in 2026: The Complete Map (30+ Tools Compared)

David Van Assche (S.L) — Wed, 15 Apr 2026 12:18:51 +0000

A sequel to my most-read post. Six months later, the landscape exploded. Here's every tool that matters, what it costs, and what it actually does.

The AI coding tool market went from "a few options" to "overwhelming" in about six months. New CLIs weekly. Pricing wars. Open-source alternatives rivaling the paid ones. Chinese models hitting 77%+ on SWE-bench. Free tiers that would've been unthinkable a year ago.

I've tested, researched, or tracked 30+ tools. Here's the complete map.

Tier 1: Cloud Subscriptions (Pay Monthly, They Host Everything)

These are the "just works" options. You pay, they handle models, infrastructure, and updates.

Tool	Monthly Cost	Model(s)	Type	SWE-bench	Notable
Claude Code	$17-20 (Pro), $100-200 (Max)	Claude 4.6 Opus/Sonnet	Terminal agent	80.9%	1M context. Uses 5.5x fewer tokens than Cursor. Full hook/plugin system.
Cursor	$16/mo	Multi-model	VS Code fork	Varies	Largest community. Best tab completions. Most polished UX.
Windsurf	$20/mo	Multi-model	IDE	Varies	"Flows" persistent context. Raised from $15 in March 2026.
Codex CLI	With ChatGPT Plus ($20/mo)	GPT-5 series	CLI + Desktop	—	Cloud sandbox execution. Autonomous agent.
Antigravity	$20 (Pro), $250 (Ultra)	Gemini	Agent IDE	—	Google's entry. Parallel agents. Built-in Chrome for testing.
Mistral Vibe	$15/mo (Le Chat Pro)	Devstral 2	CLI	—	Apache 2.0 source code. Paid models.
Amp (Sourcegraph)	Free tier ($10/day cap)	Multi-model	CLI + IDE	—	"Deep mode" autonomous research. No markup on API costs.

The verdict: Claude Code wins on capability (1M context, best SWE-bench, hook system). Cursor wins on UX. Windsurf and Antigravity bet on parallel agents. Codex bets on cloud sandboxing.

Token efficiency matters more than subscription price. Claude Code using 5.5x fewer tokens than Cursor means the real cost difference is bigger than the $1-4/mo subscription gap suggests.

Tier 2: Genuinely Free (Real Usage, No Tricks)

These tools offer meaningful free access — not "free trial" but actually usable for daily work:

Tool	Free Tier	What You Get	Upgrade Path
Gemini CLI	1,000 requests/day	Gemini 2.5 Pro/Flash routing. Just login with Google.	Pay-as-you-go
GitHub Copilot CLI	50 premium requests/mo	Deep GitHub integration. Natural for existing users.	$10/mo
Amazon Q Developer	Free tier	Best for AWS-heavy workflows.	AWS pricing
Kiro (Amazon)	Free tier	Spec-driven: generates requirements before code. Auditable trail.	TBD
Qwen Code	Free API (!)	Alibaba's CLI agent. Apache 2.0. Completely free API access.	—

Gemini CLI at 1,000 free requests/day is the story here. For many developers, this is effectively unlimited. If you're budget-constrained or evaluating, start here.

Qwen Code's free API is underappreciated. Alibaba is subsidizing it for market share — take advantage while it lasts.

Tier 3: Open Source BYOK (Free Tool, Bring Your API Key)

The largest category. Zero subscription — you pay only for model inference via API keys:

Tool	GitHub Stars	Type	Model Support	What Makes It Different
OpenCode	140K+	CLI	75+ providers	Universal adapter. If a model exists, OpenCode supports it.
Aider	39K+	CLI	Any (inc. local)	Git-native. Auto-commits. Most mature. 4.1M installs, 15B tokens/week.
Cline	— (5M installs)	VS Code ext	Any	Most adopted open-source coding extension.
Continue.dev	26K	IDE ext	Any	Only tool with full VS Code + JetBrains support.
Goose	—	CLI + Desktop	Any + MCP	Block/Square's agent. Apache 2.0. Native MCP integration.
Roo Code	—	VS Code ext	Any	"When other agents break down" — reputation for reliability on large multi-file changes.
OpenClaw	—	CLI	GLM, MiniMax, Qwen, etc	Gateway to Chinese model ecosystem.
Zed	—	Editor	BYOK	Rust-native. Fastest editor in the category.
iFlow	—	CLI	Any OpenAI-compatible	SubAgents. Controlled file permissions.
Kimi Code CLI	—	CLI	Kimi K2.5	Moonshot's agent. 100-agent swarm capability.
BLACKBOX	—	Multi	Proprietary + BYOK	Completions + chat + search.

The real cost of BYOK: With Claude Sonnet at $3/$15 per million tokens, moderate daily use runs $10-15/month. With OpenRouter, you can compare prices across 100+ models. With local models, the cost is $0.

Aider remains the gold standard for terminal pair-programming. Git-native workflows, clean commit history, works with everything from GPT to local Ollama models.

Tier 4: Truly Local (Offline, Self-Hosted, Zero Cloud)

For the privacy-conscious, air-gapped environments, or anyone who wants zero recurring costs:

Inference Runtimes

Runtime	Best For	Effort	Speed
Ollama	Easiest start. One command: `ollama pull qwen2.5-coder`	Minimal	Good
llama.cpp	Maximum control. Custom compilation for your exact hardware.	High	Best (tuned)
LM Studio	Visual model management. Side-by-side comparison. GUI sliders.	Minimal	Good
vLLM	Production serving. PagedAttention cuts memory 50%+. 2-4x throughput.	Medium	Production-grade
Tabby	Self-hosted copilot. Full IDE integration on your own infra.	Medium	Good

Best Local Coding Models (April 2026)

Model	Params	SWE-bench	License	Runs On
GLM-5 (Zhipu)	744B MoE (40B active)	77.8%	MIT	vLLM / llama.cpp (needs 80GB+ VRAM for full)
Kimi K2.5 (Moonshot)	1T MoE	76.8%	Open	Similar — enterprise hardware
Devstral 2 (Mistral)	—	—	Apache 2.0	Ollama, llama.cpp
Qwen 2.5 Coder (Alibaba)	7B-72B	—	Apache 2.0	Ollama (7B on laptop, 32B on desktop)
MiniMax M2	230B MoE (10B active)	—	Open	8% of Claude's price, 2x speed
DeepSeek Coder V2	Various	—	MIT	Ollama, llama.cpp

For a laptop: Qwen 2.5 Coder 7B or DeepSeek Coder V2 7B via Ollama. Runs fine on 16GB RAM.

For a desktop with GPU: Qwen 2.5 Coder 32B via Ollama. Excellent quality, runs on RTX 3060 12GB.

For a server: GLM-5 or Kimi K2.5 via vLLM. These compete with Claude on coding benchmarks.

Tier 5: Model Routers (Connect Anything to Anything)

Router	What It Does
9router	Connects 40+ providers to Claude Code, Cursor, Copilot, Antigravity, etc.
CLIProxyAPI	Wraps Gemini CLI, Codex, Claude Code as OpenAI-compatible API. Use free Gemini models through any tool.
OpenRouter	Universal API gateway. Compare prices across 100+ models. Pay-per-token.

CLIProxyAPI is wild: it wraps Gemini CLI's free tier as an OpenAI-compatible API, which means you can use Gemini 2.5 Pro through Aider, Cline, or any OpenAI-compatible tool — for free.

Quick Decision Matrix

If you want...	Use this
Best capability, cost be damned	Claude Code (Max)
Best free experience	Gemini CLI
Best open-source CLI	Aider
Best IDE experience	Cursor
Best for teams	Continue.dev (VS Code + JetBrains)
Zero cloud dependency	Ollama + Qwen 2.5 Coder
Best Chinese model access	OpenClaw
Planning before coding	Kiro
Git-native workflows	Aider
Parallel agents	Antigravity or Windsurf

Next in this series: *Part 2 — Running AI Coding Agents for Free: The Open Source & Local Guide** — deep dive into BYOK setups, local model configuration, and getting Claude-level performance without a subscription.*

Also: *Part 3 — What Every AI Coding Tool Gets Wrong** — the measurement gap that none of these tools address.*

This is a sequel to The best (free - cheap) AI friendly Cli and Coding environments.

Adding Epistemic Hooks to Your Workflow: From pip install to Measured AI in 5 Minutes

David Van Assche (S.L) — Wed, 15 Apr 2026 11:29:51 +0000

Part 4 of the Epistemic AI series. Parts 1-3 explained why measurement matters. Now: how to wire it into your actual workflow.

This is the hands-on article. By the end, you'll have Empirica running in a real project with measured epistemic transactions. Everything here is copy-pasteable.

Prerequisites

Python 3.10+
A git repository (any project)
Claude Code (optional but recommended — gives you the full hook integration)

Step 1: Install

pip install empirica

Verify:

empirica --version
# empirica 1.8.x

Step 2: Initialize Your Project

cd your-project
empirica project-init

This creates .empirica/ in your project root:

.empirica/
├── project.yaml          # Project config (name, evidence profile)
├── config.yaml           # Empirica settings
└── sessions/
    └── sessions.db       # SQLite — all epistemic data lives here

What just happened: Your project is now registered in Empirica's workspace database. Every session, transaction, finding, and calibration score will be tracked here.

Step 3: Wire Into Claude Code (Recommended)

empirica setup-claude-code

This installs hooks into Claude Code's plugin system:

Hook	When It Fires	What It Does
session-init	Conversation starts	Creates session, loads context
sentinel-gate	Every tool call	Gates praxic actions behind CHECK
pre-compact	Before context compression	Saves epistemic snapshot
post-compact	After compression	Restores state, continues transaction
session-end	Conversation ends	Auto-POSTFLIGHT if needed

After this, every Claude Code conversation in this project is automatically measured. No manual commands needed — the hooks handle PREFLIGHT, CHECK gating, and POSTFLIGHT.

The Sentinel: Investigation Before Action

The most important hook is the Sentinel — it intercepts every tool call and checks:

Is there an open transaction? (PREFLIGHT was run)
Has CHECK been passed? (Investigation is done)
Is this a noetic tool (read-only) or praxic (writes/edits)?

Noetic tools (Read, Grep, Glob, search) are always allowed — investigation should never be blocked.

Praxic tools (Edit, Write, Bash commands that modify) require a valid CHECK first. This prevents the AI from jumping straight to implementation without understanding the problem.

Without Sentinel:
  User: "Fix the auth bug"
  AI: *immediately starts editing files*  ← no investigation

With Sentinel:
  User: "Fix the auth bug"
  AI: *reads code, logs findings*          ← forced to investigate
  AI: *submits CHECK with what it learned* ← gates the transition
  AI: *now allowed to edit*                ← acts from understanding

This isn't a bureaucratic slowdown — it's the mechanism that forces the investigation that makes the AI's work better.

Step 4: Your First Measured Transaction

If you're NOT using Claude Code (or want to understand the manual flow):

Open the Transaction

empirica session-create --ai-id claude-code
empirica preflight-submit - << 'EOF'
{
  "task_context": "Investigate and fix the auth middleware bug",
  "work_type": "code",
  "vectors": {
    "know": 0.40,
    "uncertainty": 0.50,
    "context": 0.55,
    "clarity": 0.45,
    "do": 0.60,
    "engagement": 0.85
  },
  "reasoning": "Starting auth investigation. Read the bug report but haven't looked at the code yet. Moderate context from project familiarity."
}
EOF

Be honest with the starting vectors. The whole point is measuring the delta — inflating your PREFLIGHT just makes the learning look smaller.

Investigate and Log

# What you discover
empirica finding-log \
  --finding "Auth middleware chains Express next() at routes/auth.js:45. JWT validation happens in middleware, not route handler." \
  --impact 0.5

# What you don't know
empirica unknown-log \
  --unknown "How does the session store handle concurrent requests? No locking visible."

# Decisions you make
empirica decision-log \
  --choice "Use httpOnly cookies for refresh tokens instead of localStorage" \
  --rationale "XSS attack surface reduction. localStorage is accessible to any script." \
  --reversibility exploratory \
  --confidence 0.8

# What didn't work
empirica deadend-log \
  --approach "Tried passport.js for JWT auth" \
  --why-failed "Adds 12 dependencies for a problem solvable with 30 lines of middleware"

These aren't just notes — they're grounded evidence that the calibration system uses to verify your self-assessments.

Gate the Transition

empirica check-submit - << 'EOF'
{
  "vectors": {
    "know": 0.80,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.85
  },
  "reasoning": "Investigated auth chain, understand JWT flow, found the bug (session store race condition). Ready to implement fix."
}
EOF

CHECK evaluates whether the vectors are consistent with the evidence you logged. If you claim know: 0.80 but logged zero findings and zero unknowns, it'll flag a rushed assessment.

The decision is either proceed (you can start implementing) or investigate (go back and learn more).

Implement, Then Close

After implementing the fix:

empirica postflight-submit - << 'EOF'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.75,
    "completion": 1.0,
    "do": 0.85
  },
  "reasoning": "Auth middleware fixed. Session store race condition resolved with mutex. Tests passing."
}
EOF

The POSTFLIGHT triggers grounded verification — your self-assessment is compared against deterministic evidence (test results, git diff, linter output, artifact counts). The calibration score measures the gap.

Step 5: Read Your Calibration

The POSTFLIGHT output includes the calibration report:

{
  "calibration_score": 0.14,
  "grounded_coverage": 0.69,
  "phases": {
    "praxic": {
      "gaps": {
        "know": 0.23,
        "uncertainty": -0.25,
        "change": -0.20,
        "coherence": -0.15
      },
      "sources": ["pytest", "ruff", "git_diff", "artifacts", "prose_quality"]
    }
  }
}

Reading the gaps:

know: 0.23 — you overestimated knowledge by 0.23 (common)
uncertainty: -0.25 — you underestimated uncertainty by 0.25 (also common)
change: -0.20 — you underestimated how much you changed (git diff shows more)
coherence: -0.15 — code is cleaner than you thought (linter agrees)

Over time, these gaps should shrink. If they don't, the AI isn't learning to predict its own performance — it's just getting more confident without getting more accurate.

Step 6: Check Your Diagnostic

If anything isn't working:

empirica diagnose

This runs 11 health checks:

✅ Python version: 3.13.7 (>= 3.10)
✅ empirica CLI on PATH
✅ Claude config dir exists (~/.claude/)
✅ Plugin files installed
✅ settings.json valid
✅ Statusline configured
✅ Hooks registered (6/6)
✅ Marketplace registered
✅ Statusline runnable
✅ Project initialized (.empirica/ found)
✅ Active session in DB

If any check fails, the output includes the exact fix command.

What You Get

After a few sessions, you'll have:

Calibration trajectory — are your estimates getting more accurate?
Artifact history — findings, unknowns, dead-ends, decisions, all searchable
Learning deltas — measurable improvement (or stagnation) per transaction
Grounded evidence — objective measurement that doesn't depend on self-report
Cross-session persistence — learning survives context compaction

This is epistemic infrastructure. Not a prompt. Not a wrapper. Measurement that makes the invisible visible.

Next and final: **Part 5 — The Prosodic Memory Layer* — how AI learns your communication patterns and adapts its voice to different platforms.*

Empirica on GitHub | Part 1 | Part 2 | Part 3

Grounded Calibration vs Self-Assessment: Why Your AI's Confidence Score Is Lying

David Van Assche (S.L) — Wed, 15 Apr 2026 08:53:18 +0000

Part 3 of the Epistemic AI series. Parts 1 and 2 introduced the epistemic gap and how to measure it. Now: why the AI's self-report can't be trusted — and what to do about it.

Your AI tells you it's 85% confident. But what does that number actually mean? Nobody checked. There's no ground truth. It's a student grading their own exam, and the grade is always suspiciously high.

This is the calibration problem, and it's more insidious than it sounds.

Why Self-Assessment Is Structurally Unreliable

When an AI agent reports its epistemic vectors (know = 0.85, uncertainty = 0.10), it's making a prediction about its own internal state. This prediction is corrupted by at least three systematic biases:

1. The Completion Bias

LLMs are trained to produce helpful, confident responses. When asked "how well do you understand this?", the model gravitates toward the answer that sounds most competent. This isn't deception — it's the same optimization pressure that makes models agree with user corrections even when the user is wrong.

# What the AI reports:
know: 0.85  "I understand the codebase well"

# What the evidence shows:
- 3 test failures in the module it just edited
- 2 linter violations it didn't catch
- Referenced a function that was renamed 3 commits ago

The gap between 0.85 and the evidence isn't malice. It's structural overconfidence baked into the training objective.

2. The Anchoring Effect

Once the AI declares a PREFLIGHT vector (say, know = 0.60), it anchors to that starting point. The POSTFLIGHT assessment tends to show "improvement" regardless of what actually happened:

PREFLIGHT:  know = 0.60  (declared at session start)
POSTFLIGHT: know = 0.85  (looks like learning!)

But did it actually learn?
Or did it just decide enough time had passed?

Without external verification, you can't distinguish genuine learning from narrative completion — the AI telling a story about getting smarter because that's the expected arc.

3. The Unknown Unknowns

The most dangerous blind spot: the AI can't report uncertainty about things it doesn't know it doesn't know. If it never investigated the session store's concurrency model, it won't report low confidence on session handling — because it doesn't know there's something to be uncertain about.

AI: "I'm confident about the auth implementation" (know = 0.85)
Reality: auth works, but the session store race condition
         it didn't investigate will break under load.
         The AI doesn't report uncertainty because
         it never discovered the problem exists.

Grounded Verification: The Fix

The solution isn't better prompting or asking the AI to "be more honest." The solution is deterministic evidence — measurements that don't come from the AI's self-report.

What "Grounded" Means

Grounded evidence comes from services that produce facts, not opinions:

Evidence Source	What It Measures	Maps To
pytest results	Tests passing/failing	know, do, change
ruff/pylint	Code quality violations	coherence, signal
radon	Cyclomatic complexity	density, clarity
git diff	Lines actually changed	change, state
pyright	Type safety	coherence
Finding count	Investigation breadth	know, signal
Unknown resolution rate	Learning evidence	do, completion
textstat	Prose readability	clarity, density

These sources don't lie. They don't have completion bias. They don't anchor to previous assessments.

The Calibration Score

Empirica computes a calibration score by comparing the AI's self-assessment against grounded evidence:

Self-assessed:  know = 0.85, uncertainty = 0.10
Grounded:       know = 0.62, uncertainty = 0.35

Calibration gaps:
  know:        overestimate by 0.23
  uncertainty: underestimate by 0.25
  coherence:   underestimate by 0.20 (tests show code is cleaner than claimed)
  change:      underestimate by 0.40 (git shows more change than reported)

Calibration score: 0.14 (0.0 = perfect, 1.0 = completely uncalibrated)
Grounded coverage: 69% (evidence covers 69% of claimed vectors)

This is real output from an actual Empirica session. The AI was overestimating its knowledge by 0.23 and underestimating its uncertainty by 0.25 — the most common pattern we see.

Coverage Matters

Not all vectors can be grounded. If the AI is doing research (no code written), there's no pytest or git diff to verify against. Empirica tracks grounded coverage — what percentage of the self-assessment has deterministic evidence behind it.

# When coverage < 30%, calibration is declared insufficient
if grounded_coverage < 0.3:
    calibration_status = "insufficient_evidence"
    # Self-assessment stands — but honestly flagged as unverified

This is more honest than producing a phantom calibration score from sparse data. When we don't have enough evidence, we say so — and the self-assessment stands unchallenged rather than being falsely "verified."

What Happens Over Time

The calibration gap should shrink across transactions. If the AI consistently overestimates know by 0.23, the system provides feedback:

Previous transaction feedback:
  overestimate_tendency: [know, context]
  underestimate_tendency: [uncertainty, coherence, change]

  Note: "Be more cautious with know estimates,
         less cautious with uncertainty estimates."

This feedback is injected into the next PREFLIGHT. Over time, the AI's self-assessments become more accurate — not because the model changed, but because the measurement infrastructure makes overconfidence visible and costly.

The Sycophancy Connection

Calibration and sycophancy are the same problem viewed from different angles:

Sycophancy: AI agrees with the user to avoid conflict
Overconfidence: AI agrees with itself about its own competence

Both come from the same training pressure: produce the response that seems most helpful and aligned. Grounded verification breaks both patterns by introducing an external reference point that neither the AI nor the user controls.

When the AI says "know = 0.85" and the evidence says "know = 0.62", there's no way to talk your way out of it. The tests failed. The linter found issues. The gap is measured.

Try It

pip install empirica
cd your-project && empirica project-init

# After a work session, check calibration:
empirica postflight-submit - << 'EOF'
{
  "vectors": {"know": 0.85, "uncertainty": 0.10, "change": 0.70},
  "reasoning": "Implemented auth middleware, tests passing"
}
EOF

# The POSTFLIGHT output shows:
#   calibration_score: 0.14
#   grounded_coverage: 69%
#   gaps: know overestimate by 0.23, uncertainty underestimate by 0.25
#   sources: pytest, ruff, git_diff, artifacts, prose_quality

The calibration loop runs automatically on every POSTFLIGHT. No extra commands needed — just work normally and measure honestly.

Next: **Part 4 — Adding Epistemic Hooks to Your Workflow* — the step-by-step integration tutorial. From pip install to your first measured transaction in 5 minutes.*

Empirica on GitHub | Part 1 | Part 2

Measuring What Your AI Learned: Epistemic Vectors in Practice

David Van Assche (S.L) — Mon, 13 Apr 2026 17:45:28 +0000

Part 2 of the Epistemic AI series. In Part 1, we defined the problem: AI tools don't track what they know. Here, we make it measurable.

When we talk about "what the AI knows," we're not being metaphorical. Knowledge has structure, and that structure is measurable — not perfectly, but well enough to catch the failures that matter.

The 13 Epistemic Vectors

Empirica tracks 13 dimensions of an AI's knowledge state. Not as a gimmick — each vector maps to a specific class of failure you've seen in practice:

vectors = {
    "know":        0.65,  # Domain understanding
    "uncertainty": 0.35,  # What I DON'T know (explicit!)
    "context":     0.70,  # Surrounding state awareness
    "clarity":     0.80,  # How clear the path forward is
    "coherence":   0.75,  # Internal consistency
    "signal":      0.60,  # Information quality vs noise
    "density":     0.55,  # Relevant knowledge per unit context
    "state":       0.70,  # Current system/project state
    "change":      0.40,  # Amount of change made
    "completion":  0.30,  # Progress toward goal
    "impact":      0.65,  # Significance of work
    "engagement":  0.85,  # How actively working the problem
    "do":          0.70,  # Ability to execute
}

Why 13? Because we kept finding failure modes that weren't captured:

know without uncertainty = overconfident AI
clarity without signal = clear path built on noise
completion without change = claiming done but nothing happened
engagement without do = actively spinning without capability

Each pair creates a tension that prevents gaming. You can't claim high know while uncertainty is also high — the measurement catches the contradiction.

The Transaction Lifecycle

Vectors aren't static. They change as the AI works. The epistemic transaction is the measurement window:

PREFLIGHT → [investigate] → CHECK → [implement] → POSTFLIGHT

PREFLIGHT: Declare Your Baseline

Before starting work, the AI declares what it thinks it knows:

empirica preflight-submit - << 'EOF'
{
  "task_context": "Implement JWT auth middleware",
  "vectors": {
    "know": 0.45,
    "uncertainty": 0.40,
    "context": 0.60,
    "clarity": 0.50
  },
  "reasoning": "Read the route definitions but haven't explored 
    the middleware chain yet. Moderate context from project structure."
}
EOF

This is the starting measurement. It's a prediction: "Here's how well I think I understand this before investigating."

Investigation Phase (Noetic)

The AI reads code, searches patterns, builds understanding. Everything it discovers gets logged:

# What you learned
empirica finding-log --finding "Auth middleware uses Express next() 
  pattern at routes/auth.js:45" --impact 0.5

# What you don't know
empirica unknown-log --unknown "How are user roles differentiated? 
  No role field in JWT payload schema."

# What didn't work
empirica deadend-log --approach "Tried passport.js integration"   --why-failed "Too heavy for JWT-only auth, would add 12 dependencies"

These aren't just notes — they're grounded evidence that the calibration system uses to verify self-assessments.

CHECK: Gate the Transition

empirica check-submit - << 'EOF'
{
  "vectors": {
    "know": 0.82,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.88
  },
  "reasoning": "Investigated middleware chain, understand JWT flow, 
    found role definitions in JWT claims. Ready to implement."
}
EOF

The system evaluates: did the vectors change in a way that's consistent with the evidence logged? If the AI claims know: 0.82 but logged zero findings and zero unknowns, that's a rushed assessment — the gate catches it.

This is the critical insight: you can't skip investigation and go straight to acting. The measurement forces understanding before execution.

POSTFLIGHT: Measure the Learning

After implementation:

empirica postflight-submit - << 'EOF'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.80,
    "completion": 1.0
  },
  "reasoning": "Auth middleware implemented with role guards. 
    Unit tests passing. Learned about Express 5 async changes."
}
EOF

The delta between PREFLIGHT and POSTFLIGHT is the learning:

know:        0.45 → 0.90  (+0.45)  # Learned a lot
uncertainty: 0.40 → 0.08  (-0.32)  # Resolved most unknowns
change:      0.00 → 0.80  (+0.80)  # Made substantial changes
completion:  0.00 → 1.00  (+1.00)  # Goal met

This delta IS the measurement. Over time, you can see:

Does the AI consistently overestimate its starting knowledge?
Does it underestimate uncertainty?
Do its estimates get more accurate across sessions?

Grounded Verification: The Part That Keeps It Honest

Self-assessment alone is self-serving. The grounded verification layer compares the AI's claims against deterministic evidence:

# AI claims: know=0.90, change=0.80
# Grounded evidence:
evidence = {
    "test_results": {"passed": 42, "failed": 3},     # 3 failures!
    "ruff_violations": 2,                              # lint issues
    "git_diff_lines": 156,                            # real change metric
    "findings_logged": 5,                              # investigation breadth
    "unknowns_resolved": 3,                            # learning evidence
}

# Grounded calibration:
# - test failures → know is probably ~0.75, not 0.90
# - git diff confirms change=0.80 is reasonable
# - 5 findings + 3 resolved unknowns → investigation was real

The calibration score measures the distance between self-assessment and grounded evidence. A score of 0.0 means perfect calibration. In practice, we see scores of 0.10-0.30 — the AI is usually overconfident, and the grounded layer catches it.

What This Looks Like in Practice

Here's a real POSTFLIGHT from an Empirica session (editing for clarity):

Calibration score: 0.134
Grounded coverage: 69.2%

Gaps:
  know:        overestimate by 0.33  (claimed 0.82, evidence shows 0.49)
  uncertainty: underestimate by 0.13 (claimed 0.15, evidence shows 0.28)
  coherence:   underestimate by 0.20 (claimed 0.75, evidence shows 0.95)

Sources: artifacts, codebase_model, prose_quality, 
         document_metrics, source_quality, action_verification
Sources failed: []  (all evidence collectors healthy)

The AI was overestimating its knowledge and underestimating its uncertainty — the most common pattern. But now we can see it, which means we can correct for it in the next transaction.

Try It

pip install empirica
cd your-project
empirica project-init
empirica setup-claude-code

# Start a measured session:
empirica session-create --ai-id claude-code
# → Opens transaction, gates investigation before action

The framework is open source, the measurement is real, and the calibration improves over time. Not because the model gets better — because the measurement infrastructure makes overconfidence visible.

Next in the series: **Part 3 — Grounded Calibration vs Self-Assessment* — why the AI's self-report is structurally unreliable and how deterministic evidence changes the game.*

Empirica on GitHub | Part 1: Your AI Doesn't Know What It Doesn't Know

Your AI Doesn't Know What It Doesn't Know — And That's the Biggest Problem in AI Tooling

David Van Assche (S.L) — Mon, 13 Apr 2026 17:45:27 +0000

"The most dangerous thing isn't an AI that's wrong. It's an AI that's wrong and confident about it."

Every developer working with AI agents has hit this wall: your tool says something with absolute confidence, and it's completely wrong. Not because the model is bad — because nothing in the system tracks what it actually knows versus what it's guessing.

This is the epistemic gap, and it's the single biggest unsolved problem in AI developer tooling.

The Problem: Confidence Without Calibration

When you use Claude, ChatGPT, or any LLM-based tool:

It never says "I'm 60% sure about this"
It doesn't distinguish between "I read this in the codebase" and "I'm inferring this from patterns"
After a long conversation, it loses track of what it verified versus what it assumed
When context compresses, learned insights vanish silently

This isn't a model problem. GPT-5 won't fix it. Claude Opus 5 won't fix it. It's a measurement problem at the infrastructure layer.

What Actually Happens in Practice

You ask your AI to update the auth middleware. It says "Done!" with 100% confidence. But:

Did it check if JWT was already configured? Maybe.
Did it verify the session store compatibility? Probably not.
Will it remember this decision next session? No.
Did it investigate before acting, or just pattern-match? You'll never know.

The AI doesn't track:

What it investigated versus what it assumed
Which assumptions turned out to be wrong
What it learned that should persist across sessions
How its confidence should change based on evidence

Why This Matters More Than You Think

If you're building AI-assisted workflows, this gap compounds:

No learning curve. Your AI makes the same mistakes on day 100 that it made on day 1, because nothing measures whether its predictions improve.
Invisible context loss. When conversations compact (Claude Code, Cursor, etc. all do this), the AI loses track of what it verified. It re-assumes things it already checked.
Sycophancy masquerading as agreement. When you push back on a wrong answer, the AI often just agrees with you — not because you're right, but because agreement is the path of least resistance. Without calibration, there's no mechanism to distinguish "user is right, I should update" from "user is insistent, I should capitulate."
No grounded verification. The AI self-reports its confidence. Nobody checks. It's like a student grading their own exam.

What Epistemic Measurement Looks Like

Imagine if your AI tooling tracked 13 dimensions of its own knowledge state:

Vector	What It Measures
know	How well it understands the domain
uncertainty	What it DOESN'T know (explicit)
context	Understanding of surrounding state
clarity	How clear the path forward is
signal	Quality of information vs noise
change	Amount of change made
completion	Progress toward current goal

And imagine it measured these at three points:

PREFLIGHT: "Here's what I think I know before starting"
CHECK: "Here's what I learned during investigation — am I ready to act?"
POSTFLIGHT: "Here's what I actually learned and changed"

The delta between PREFLIGHT and POSTFLIGHT IS the learning. Not a vibe. A measurement.

The Grounded Calibration Loop

Self-assessment alone is sycophantic. What you actually need is a comparison between what the AI claims to know and what deterministic evidence shows:

AI self-assessment: know = 0.85, uncertainty = 0.10
Grounded evidence (test results, linter, git diff): know = 0.62, uncertainty = 0.35
Calibration gap: overestimating know by 0.23, underestimating uncertainty by 0.25
Adjustment signal: "Be more cautious with know estimates in future transactions"

The grounded evidence comes from deterministic services — test results, linter output, git metrics, documentation coverage — things that don't lie. When the AI says "I know this codebase well" but the test suite shows 3 failures in the module it just edited, the gap is measurable.

This is what calibration means: the distance between what you claim to know and what the evidence shows. Over time, this distance should shrink. If it doesn't, the AI isn't getting better — it's just getting more confident.

This Isn't Theory — It's Infrastructure

We've been building this measurement layer as an open-source framework called Empirica. It's a Python CLI that hooks into Claude Code (and any LLM tool that supports hooks) to:

Track epistemic vectors across sessions
Gate actions behind investigation (you can't write code until you've demonstrated understanding)
Verify self-assessments against deterministic evidence
Persist learning across context compaction
Measure calibration drift over time

It's not a wrapper or a prompt. It's measurement infrastructure that makes the epistemic gap visible and closes it over time.

Getting Started

Prerequisites: Python 3.10+, a project with a git repo, and optionally Claude Code for the full hook integration.

# Install Empirica
pip install empirica

# Initialize tracking in your project
cd your-project
empirica project-init

# If using Claude Code, wire up the hooks:
empirica setup-claude-code

That's it. From this point, every Claude Code conversation in this project is measured — PREFLIGHT declares baseline knowledge, CHECK gates the transition from investigation to action, and POSTFLIGHT captures what was actually learned. The Sentinel (an automated gate) ensures investigation happens before implementation.

Without Claude Code, you can still use the CLI directly to track any AI workflow:

# Declare what you know before starting
empirica preflight-submit - <<< '{"vectors": {"know": 0.5, "uncertainty": 0.4}, "reasoning": "Starting auth investigation"}'

# Log what you discover
empirica finding-log --finding "JWT middleware uses Express next() pattern" --impact 0.5

# Measure what you learned
empirica postflight-submit - <<< '{"vectors": {"know": 0.85, "uncertainty": 0.1}, "reasoning": "Auth flow fully understood"}'

What's Next in This Series

This is Part 1 of a series on epistemic AI — making AI tools that actually know what they know:

Part 2: Measuring What Your AI Learned — epistemic vectors in practice
Part 3: Grounded Calibration vs Self-Assessment — why self-reporting fails
Part 4: Adding Epistemic Hooks to Your Workflow — integration tutorial
Part 5: The Voice Layer — how AI learns your communication patterns

Each article will have runnable code, real measurements, and honest assessments of what works and what doesn't. Because that's the whole point — if you're not honest about uncertainty, you're just building a more eloquent liar.

Empirica is open source (MIT) and under active development. We're a small team in Vienna building measurement infrastructure for AI. If this resonates, check us out on GitHub or follow this series for the deep dives.

Why Your AI Agent Needs Memory That Decays (and How Qdrant Makes It Work)

David Van Assche (S.L) — Fri, 06 Mar 2026 13:30:22 +0000

I've been building an open-source epistemic measurement framework called Empirica, and one of the core challenges I ran into early on was memory — not the "stuff vectors in a database and retrieve them" kind, but memory that actually behaves like memory. Things fade. Patterns strengthen with repetition. A dead-end from three weeks ago should still surface when the AI is about to walk into the same wall, but a finding from a one-off debugging session probably shouldn't carry the same weight six months later.

That's where Qdrant comes in, and I want to share how we're using it because it's a fairly different use case from the typical RAG setup.

The problem with flat retrieval

Most RAG implementations treat memory as a flat store — embed a chunk, retrieve by similarity, done. That works for document Q&A, but it falls apart when you need temporal awareness. An AI agent working across sessions and projects needs to know not just what was discovered, but when, how confident we were, and whether that knowledge is still valid.

Think about how your own memory works — you don't recall every detail of every workday equally. The time you accidentally dropped the production database? That stays vivid. The routine PR you reviewed last Tuesday? Already fading. That asymmetry is functional, not a bug.

Two memory types, one vector store

We use Qdrant for two distinct memory layers:

Eidetic memory — facts with confidence scores. These are discrete epistemic artifacts: findings ("the auth system uses JWT refresh with 15min expiry"), dead-ends ("tried migrating to async but the ORM doesn't support it"), decisions ("chose SQLite over Postgres because single-user, no server needed"), mistakes ("forgot to check null on the config reload path"). Each carries a confidence score that gets challenged when new evidence contradicts it — a finding's confidence drops if a related finding surfaces that undermines it. Think of it as an immune system: findings are antigens, lessons are antibodies.

Episodic memory — session narratives with temporal decay. These capture the arc of a work session: what was the AI investigating, what did it learn, how did its confidence change from start to finish. Episodic memories naturally decay over time — a session from yesterday is more relevant than one from last month, unless the pattern keeps repeating, in which case it strengthens instead of fading.

Both live in Qdrant as separate collections per project, which gives us clean isolation and lets us do cross-project pattern discovery when we need it.

The retrieval side — Noetic RAG

I've been calling this approach "Noetic RAG" — retrieval augmented generation on the thinking, not just the artifacts. When an AI agent starts a new session, we don't just load documents. We load:

Dead-ends that match the current task (so it doesn't repeat failed approaches)
Mistake patterns with prevention strategies
Decisions and their rationale (so it understands why things are the way they are)
Episodic arcs from similar sessions (temporal context)
Cross-project patterns (if the same anti-pattern appeared in project A, surface it in project B)

The similarity search here isn't just cosine distance on the task description — it's filtered by recency, weighted by confidence, and scoped by project (with optional global reach for cross-project learnings).

What this looks like in practice

# Focused search: eidetic facts + episodic session arcs
empirica project-search --project-id <ID> --task "auth token rotation"

# Full search: all collections
empirica project-search --project-id <ID> --task "auth token rotation" --type all

# Include cross-project patterns
empirica project-search --project-id <ID> --task "auth token rotation" --global

When context compacts (and it will — Claude Code's 200k window fills up fast), the bootstrap reloads ~800 tokens of epistemically ranked context instead of trying to reconstruct everything from scratch. Findings, unknowns, active goals, architectural decisions — weighted by confidence and recency.

The temporal dimension

This is the part that makes Qdrant particularly well-suited. We store timestamps and decay parameters as payload fields, and filter on them at query time. A dead-end from yesterday with high confidence outranks a finding from last month with medium confidence. But a pattern that's been confirmed three times across two projects? That climbs in relevance regardless of age.

The decay isn't a fixed curve — it's modulated by reinforcement. Every time a pattern re-emerges, its effective age resets. Qdrant's payload filtering makes this efficient: we can do the temporal math at query time without re-embedding anything.

Why this matters beyond the obvious

The real value isn't just "AI remembers things" — it's that the memory is epistemically grounded. Every artifact has uncertainty quantification. Every session has calibration data (how accurate was the AI's self-assessment compared to objective evidence like test results and code quality metrics). The memory doesn't just tell you what happened — it tells you how much to trust what happened.

After 5,600+ measured transactions, the calibration data shows AI agents consistently overestimate their own confidence by 20-40%. Having memory that carries that calibration forward means the system gets more honest over time, not just more knowledgeable.

Try it

Empirica is MIT licensed and open source. If you're building anything where AI agents need to remember across sessions — especially if temporal awareness matters — the prosodic/episodic/eidetic architecture might be worth looking at.

GitHub: github.com/Nubaeon/empirica
Website: getempirica.com
Install: pip install empirica

Happy to answer questions about the Qdrant integration or the broader noetic RAG architecture.

The best (free - cheap) AI friendly Cli and Coding environments

David Van Assche (S.L) — Fri, 26 Sep 2025 17:01:41 +0000

With so many LLM providers and coding environments, how do you choose the right one for your next project? We all want the "best" model, but what we really need is the one that's the most reliable, the most cost-effective, and the most suited for our workflow. This guide breaks down the real-world performance, pricing, and hidden costs of the top LLM providers and CLI environments, from freemium to enterprise. We'll go beyond the marketing claims and give you the data you need to make an informed decision.

CLI and Code-Focused Environments (Sorted by Cost)

Tier 1: Free & Open-Source (Cost is just API Tokens / Free Tier Access)

Cursor CLI
- Cost Model: Free. Relies on the user's API key (OpenAI, Anthropic, etc.).
- Use Case & Features: An editor and CLI environment built around a code-aware AI. Ideal for developers who want maximum control over the model and are happy to manage their own API costs.
Qwen Code
- Cost Model: Free tier with 2,000 requests per day and a 60 RPM limit.
- Use Case & Features: A coding agent focused on tool calling and environment interaction. Offers a generous free tier for developers on a budget, perfect for experimenting with agentic workflows.
GitHub Copilot CLI
- Cost Model: Free Tier Available. New "Copilot Free" tier offers 2,000 code completions and 50 premium requests per month. Students, teachers, and open-source maintainers get Copilot Pro for free.
- Use Case & Features: Agent-powered, GitHub-native tool that executes coding tasks. This is the new, more powerful agentic Copilot CLI, replacing the older gh-copilot extension.

Tier 2: Freemium & Free-for-Individual (Generous Free Access)

Gemini Code Assist
- Cost Model: **Free for individuals (permanently).
- ** Access to higher daily limits is available through a subscription to Google AI Pro ($19.99/month), which often includes an
- extended free trial for 12 months for students in eligible regions.
- Use Case & Features: An AI-first coding assistant integrated directly into major IDEs and the terminal (Gemini CLI). The individual version is a highly generous, no-cost option.
Warp Code
- Cost Model: Freemium. Includes 150 free AI requests per month. Paid plans start around
- $15/user/month for teams.
- Use Case & Features: A complete agentic development environment that unifies the terminal, editor, and AI features. Known for its speed, local indexing, and multi-model orchestration.
Atlassian Rovodev
- Cost Model: Free tier with an Atlassian Cloud account. Quotas are based on "AI credits" tied to paid Jira/Confluence plans.
- Use Case & Features: Integrated with the Atlassian ecosystem, focusing on developer tasks within project management (Jira) and documentation (Confluence). Best for teams already on the Atlassian stack.

Tier 3: Subscription Required (Paid Access to High-End Models)

Codex CLI (OpenAI)
- Cost Model: Included with ChatGPT paid plans (Plus, Pro, Business, Enterprise), starting at $20/user/month (ChatGPT Plus).
- Use Case & Features: A comprehensive software engineering agent powered by models like GPT-5-Codex. It works in the terminal, IDE, and cloud, using tools, tracking progress with a to-do list, and supporting multi-modal input.
Claude Code (Anthropic)
- Cost Model: Likely included with a paid Claude Pro or higher subscription. Uses Anthropic's latest models (e.g., Sonnet/Opus).
- Use Case & Features: An agentic coding partner focused on extended thinking and complex, multi-step tasks. It uses planning modes and creates project memory files (CLAUDE.md) for deep context management.
Grok CLI (xAI)
- Cost Model: Requires an X Premium+ subscription, which is roughly $ 30 - $40/month for consumer access, or an API plan for token-based billing.
- Use Case & Features: Distinguished by its focus on real-time data integration (from the X platform) and its "rebellious streak." Best for projects requiring up-to-the-minute data integration alongside coding tasks.

LLM and Inference Provider Cross-Reference

OpenAI
- Freemium / Entry Level Cost: Free tier with a $5.00 credit (valid for 3 months).
- Pricing Model: Pay-per-token; different rates for different models (e.g., GPT-4o is more expensive than GPT-3.5).
- Reliability & Throughput: High. Offers tiered usage limits that scale with spend. Known for robust infrastructure but can have occasional downtime.
Anthropic
- Freemium / Entry Level Cost: Free tier with a $10.00 credit (valid for a limited period).
- Pricing Model: Pay-per-token. Haiku model is the most cost-effective.
- Reliability & Throughput: High. Free tier has rate limits (e.g., 5 RPM, 20K TPM on Haiku), which are sufficient for development and experimentation.
Deepseek
- Freemium / Entry Level Cost: Free tier with some trial credits.
- Pricing Model: Pay-per-token, with separate rates for input (cache hit/miss) and output tokens.
- Reliability & Throughput: Generally good, but may not have the same global infrastructure as larger providers. Good for cost-sensitive projects.
Qwen (Dashscope)
- Freemium / Entry Level Cost: Free tier with 2,000 requests per day and a 60 RPM limit.
- Pricing Model: Pay-per-token after free tier. The Qwen-Flash model is very cheap for simple tasks.
- Reliability & Throughput: Good. The free tier is generous for personal projects and offers a great way to test the model's capabilities.
Fireworks AI
- Freemium / Entry Level Cost: $50 monthly spend limit with a valid payment method.
- Pricing Model: Pay-per-token. Very competitive rates for various open-source models like Deepseek and Qwen.
- Reliability & Throughput: Very high. Known for its speed and low latency. The free tier is well-suited for experimentation and small-scale applications before committing to a higher spend tier.
Amazon Bedrock
- Freemium / Entry Level Cost: Pay-as-you-go, no upfront cost. Some models have a free trial period.
- Pricing Model: Complex. On-demand, provisioned throughput, and commitment-based pricing. Pricing varies significantly by model.
- Reliability & Throughput: Extremely High. Backed by AWS's robust infrastructure, offering high reliability and the ability to scale. Best for production use cases where you need a consistent throughput.
Hugging Face
- Freemium / Entry Level Cost: Free tier for inference endpoints on some models.
- Pricing Model: Pay-per-hour for dedicated inference endpoints.
- Reliability & Throughput: Reliability depends on the model's popularity and the infrastructure supporting it. The free tier can have high latency due to a queuing system.
OpenRouter
- Freemium / Entry Level Cost: Pay-per-token. You only pay for what you use.
- Pricing Model: Aggregator. Provides access to many models (including OpenAI and Anthropic) on a single API key, often at competitive rates.
- Reliability & Throughput: Varies by model, but generally high. The platform manages the back-end complexity of multiple models, making it a great entry point for comparing different models.