Pascal CESCATO read my SEO audit agent piece and left this in the comments:
"You don't need an LLM for this. Everything you're sending to Claude can be done directly in Python — zero cost, fully deterministic, no hallucination risk."
He was right. And wrong. And the conversation that followed is the reason I rebuilt the entire thing.
What Pascal Actually Said
The audit agent I published checks title length, meta description length, H1 count, and canonical tags. Pascal's point: those are character counts and presence checks. A regex does that. You don't pay $0.006 per URL for a regex.
I pushed back. The flags array requires judgment — "title reads like a navigation label rather than a page description" isn't a character count. Pascal conceded, then reframed:
"Two-pass makes more sense. Deterministic Python for binary checks, model call only on pages that pass the mechanical audit but need a second look. You pay per genuinely ambiguous case, not per URL."
That's a better architecture than what I shipped. I said so publicly.
Then Julian Oczkowski extended it:
"Deterministic rules first, lightweight models for triage, larger models reserved for genuinely ambiguous edge cases. Keeps latency low, costs predictable, reduces unnecessary LLM dependency."
Three people in a comment thread had just designed something I hadn't thought to name. Pascal called it two-pass. Julian called it tiered. I called it the cost curve — a sliding scale from free to expensive, routed by what the task actually requires.
The Cost Curve
Tier 1 — Deterministic Python. Cost: $0.
Title >60 characters? FAIL. Description missing? FAIL. H1 count == 0? FAIL. These are not judgment calls. A model that can reason about Shakespeare does not need to be invoked to count to 60.
Tier 2 — Haiku. Cost: ~$0.0001 per URL.
Title present but 4 characters long. Description present but 30 characters. Status code is a redirect. These pass the mechanical audit but something is off. Haiku is cheap enough that calling it for ambiguous cases costs less than the time you'd spend debugging why the deterministic check missed something.
Tier 3 — Sonnet. Cost: ~$0.006 per URL.
Pages Haiku flags as needing semantic judgment. "This title passes length but reads like a navigation label." "This description duplicates the title verbatim." Sonnet earns its cost here. Not everywhere.
The insight is routing. Most pages on a typical agency site have mechanical issues — missing descriptions, long titles, no canonical. Those never need a model. The interesting cases — pages that pass every binary check but still feel wrong — are where the model earns its place.
On my last run of 50 URLs, 8 reached Sonnet. The rest resolved at Tier 1. Total cost dropped from ~$0.30 to ~$0.05. The 8 that hit Sonnet were the ones worth paying for.
What I Actually Built
I restructured the entire repo around this architecture.
core/ stays flat and MIT licensed. The original seven modules untouched. Anyone who cloned v1 still runs python core/index.py and gets identical behavior.
premium/ adds four modules. cost_curve.py handles the tier routing — audit_url(snapshot, tiered=True) runs Tier 1 first, escalates to Haiku if something fails, escalates to Sonnet only if Haiku flags semantic ambiguity. multi_client.py manages project folders — --project acme reads and writes from projects/acme/ with isolated state, input, and reports. enhanced_reporter.py generates WeasyPrint PDFs with per-URL screenshots, issues sorted by severity, and suggested fixes. rewrite_agent.py is the one Pascal didn't anticipate — after the audit, --rewrite generates improvement suggestions using the same cost curve: Tier 1 truncates titles deterministically, Haiku writes meta description suggestions, Sonnet rewrites opening paragraphs. Pass --voice-sample ./my-writing.txt and the prompt includes a sample of your writing. The suggestions sound like you, not like Claude.
main.py is the unified entry point. Free users run python main.py and get v1 behavior. Pro users add --pro, set SEO_AGENT_LICENSE, unlock the premium layer.
A full pro run looks like this:
python main.py --project client-x --pro --tiered --rewrite
That single command: reads from projects/client-x/input.csv, routes every URL through the cost curve, generates rewrite suggestions for failing pages, writes a PDF report with screenshots and severity levels, and appends a run record to the audit history.
The Architecture Decision That Matters
core/ imports nothing from premium/. Ever.
This isn't just clean code. It's a trust contract with anyone who forks the repo. The MIT-licensed core is the public good — auditable, forkable, accepts PRs. The premium layer is proprietary and closed, but it builds on a foundation anyone can inspect.
Mads Hansen in the comments named the right question: how do you prevent the auditor from developing blind spots on recurring patterns? The answer I didn't have then: run history in state.json. Each completed run appends a record — timestamps, pass counts, fail counts, report path. Over time, "this page has failed description length for six consecutive runs" is a different signal than "this page failed today." The audit becomes a monitor. Not just a snapshot.
What the Comment Thread Cost
Nothing. And produced a better architecture than I would have built alone.
Pascal's pushback separated the deterministic from the semantic. Julian's production framing gave me the three-tier structure. Apex Stack's 89K-page site showed me where the orphan detection problem lives. Mads Hansen named the blind spot question I hadn't asked.
None of that was in the original article. All of it is in the repo now.
The public comment thread is the architecture review I didn't schedule. That's what happens when you publish the honest version — the demo that fails on your own content — instead of the staged one.
The staged demo would have passed. The honest one compounded.
Full repo: dannwaneri/seo-agent. Core is MIT. Premium requires a license key. The freeCodeCamp tutorial covers the v1 build in detail: How to Build a Local SEO Audit Agent with Browser Use and Claude API.
Top comments (23)
Good piece. The cost curve holds, and "8 out of 50 reached Sonnet" is exactly the data that makes the argument. The voice-sample flag is new scope — but that's your pattern, so I'm not surprised.
The 8/50 number is the one I was most uncertain about including — specific enough to be checkable, general enough that it might not hold on other sites. Figured the honest version was worth the risk of someone running it and getting 12/50.
The voice-sample scope was yours too, indirectly. "Cheapest model that solves the problem" applied to rewrites means Sonnet only when the output actually needs to sound like a specific person. That's where the cost justifies itself.
The 8/50 being checkable is exactly why it belongs in the piece. A staged number would have been smoother and less useful. And yes — voice-sample is the cost curve applied correctly: Sonnet earns it when the output needs to sound like a specific person, not before.
The part I'm still testing. whether the 8/50 ratio holds across site types or if it's a property of the agency portfolio I was running. Editorial sites might route more to Sonnet. E-commerce with templated descriptions might route fewer. The routing logic stays the same but the cost ceiling moves.
"He was right" is a good way to open a post that could easily have been defensive. The pattern of recognizing when an LLM is solving a problem that deterministic code handles better is genuinely underappreciated — people reach for AI for things that are structurally rule-based and then wonder why results are inconsistent.
The useful generalization here: LLMs add value when the task requires judgment under ambiguity or synthesis across unstructured inputs. When the input is structured and the transformation rules are knowable, deterministic code is faster, cheaper, and more reliable. For SEO audits specifically, most of the high-value checks (missing meta, broken links, heading structure, page speed signals) are fully deterministic — the places where an LLM adds something real are interpretation and prioritization of findings, not the audit itself.
What ended up being the actual breakdown — what percentage of your audit logic stayed as LLM calls vs moved to Python?
The breakdown on the last run: 8 of 50 URLs reached Sonnet. The rest resolved at Tier 1 — pure Python, zero model calls. So roughly 84% of the audit logic moved to deterministic code, 16% genuinely needed judgment. That ratio will shift by site type — programmatic SEO with inconsistent templates skews higher, clean agency portfolios skew lower but the direction holds.
Your framing is exactly right. interpretation and prioritization of findings is where the model earns it, not the audit itself. "This page has a missing description" is a Python job. "This page passes every check but the title reads like a navigation label for an audience that came from a transactional query" is a model job. The hard part was resisting the urge to use the model for both because the demo looked cleaner that way.
This is a great breakdown of where LLMs actually make sense vs where they don’t
feels like a lot of people jump straight to AI even for things that are fully deterministic
the “cost curve” idea is really solid — routing based on complexity instead of defaulting to a model for everything
The default-to-model instinct is expensive to unlearn. Routing by what the task actually requires is the whole argument...
that’s a really good way to put it
feels like a lot of systems default to “just call a model” without asking if the task even needs one
This cost curve framing is the cleanest way I've seen anyone articulate tiered AI processing. The core insight — that routing is the architecture, not the model — is something most people building with LLMs miss entirely.
I'm the "89K-page site" mentioned in the article, and the ratio question Daniel raises at the end is exactly where I'm stuck. My site is multilingual (12 languages), so the Tier 1 → Tier 3 distribution would look very different from an agency portfolio. Things like English content accidentally rendering on a Japanese page, or hreflang tags pointing to wrong canonicals — those pass every deterministic check but are fundamentally broken for their audience. I'd estimate 30-40% of my pages would escalate to Sonnet on any given run.
The
--voice-sampleflag is an underrated addition. For programmatic SEO sites where you're generating content at scale, keeping a consistent voice across thousands of pages is a real challenge. Having the rewrite agent match a voice sample instead of defaulting to generic Claude output solves a problem I've been thinking about for weeks.The open-core licensing model (MIT core, proprietary premium) is also smart business design. Builds trust with the community while protecting the value-add.
The hreflang-to-wrong-canonical case is the one that breaks my Tier 1 assumptions cleanest. Deterministic check says canonical is present — PASS. The model looking at rendered context catches that the canonical points to the English version while the hreflang declares Japanese. That's not a missing tag, it's a logic error between two valid tags. No regex finds that. And at 30-40% Sonnet escalation across 89K pages, the cost curve math looks very different from my 8/50 agency run.
The voice-sample use case you named — programmatic SEO at scale, consistent voice across thousands of generated pages — is one I hadn't written up explicitly. The flag exists because "cheapest model that sounds like a specific person" is Sonnet, but the problem you're describing is upstream: how do you maintain voice consistency when generation is happening at volume, not post-hoc? Is the sample you'd pass a single reference document, or are you thinking per-language samples?
It is indeed, well, interesting, how much work folks are pushing to LLMs, and then paying for, that could reasonably done in code or even with a local SLM implementation. While not a solution for every situation, I think there is a shift happening from "send everything to Sonnet" to a more nuanced approached of "does it need to go there?"
The shift you're describing is the one that actually changes cost structures. "Does it need to go there?" sounds obvious until you're three months into a system where everything went to Sonnet by default and the bill is the first signal that something's wrong. Local SLM for the middle tier is the piece I haven't tested yet — Haiku is cheap enough that I haven't needed to, but at serious scale that changes.
the regex vs llm split you landed on is where most people get stuck - not because they can't see the difference, but because they scaffolded the whole thing around llm first and retrofitting feels wasteful. rebuilding was the right call.
This “cost curve” framing is really smart. Most people jump straight to LLMs for everything, but routing by ambiguity makes way more sense.
I’ve seen the same pattern: 70–80% of SEO issues are deterministic (missing meta, duplicate titles, broken canonicals), and the model is only useful for the “this technically passes but feels wrong” cases.
Also interesting that your biggest savings came from routing, not prompt optimization that’s a good reminder that architecture usually beats micro-optimizations.
Have you tried pushing the deterministic layer even further? For example clustering similar pages first and auditing templates instead of URLs one by one.
The template clustering question is the right next step and someone in another thread handed me the framing for it: if a bug lives in a template, it affects every page using that template — fix one, fix all. So the audit unit stops being a URL and becomes a failure mode. Sample one page per template per locale, catch the systemic issues at a fraction of the full-site cost, then only run individual URL audits on pages that deviate from their template's expected pattern.
I haven't shipped this yet but it's on the roadmap. The missing piece is the classification layer — you need something that groups pages by template reliably before you can sample from them. For a CMS with clean URL patterns that's straightforward. For programmatic SEO sites with 89K pages across 12 languages it gets more complex fast. What does your page inventory look like — consistent enough that URL pattern matching would cluster templates cleanly?
The tiered cost curve concept is brilliant and applies way beyond SEO audits. Pascal's two-pass reframe is the kind of feedback that makes public building worthwhile.
We applied essentially the same pattern to bootstrapping a product on $0 budget: Cloudflare free tier for hosting, Workers, and CDN. Resend free tier for transactional email. Google Gemini free tier for automation. Only escalate to paid when the free tier hits a real wall, not a hypothetical one.
The trap is the same as your SEO agent: it's tempting to throw the expensive tool at everything because it can handle it. But 'can' and 'should' diverge fast when you're paying per request. Character counts don't need Claude. Most bootstrapping tasks don't need paid infrastructure.
The "only escalate when the free tier hits a real wall, not a hypothetical one" is the discipline that's hardest to maintain. The temptation is always to provision for the ceiling case before you've hit it which means you pay for scale you haven't earned yet. Your stack is the cost curve applied to infrastructure: same logic, different layer. The failure mode in both cases is the same: optimizing for capability instead of routing by what the task actually requires. Curious where Gemini free tier actually hit a wall for you — that's usually where the architecture gets interesting...
The cost curve framing is really useful. I do AI visibility audits for small businesses and hit the same realization - most of the checks that actually matter (NAP consistency across directories, GBP completeness, schema markup presence) are binary. Either it's there or it isn't.
The expensive part isn't the audit. It's explaining what the results mean to a business owner who doesn't know what a canonical tag is. That's where the model earns its cost - translating "your H1 is missing" into "when someone asks ChatGPT for a plumber in your area, it can't figure out what page to recommend because your homepage doesn't clearly say what you do."
Pascal's two-pass insight applies beyond SEO tooling. Any time you're running an LLM on structured data that has deterministic answers, you're paying for confidence theater.
"Translating H1 missing into what ChatGPT can't do for your business" is the use case that justifies Sonnet on the output layer even when the audit itself is deterministic. The check is binary. The explanation isn't. That's a clean separation I hadn't written up explicitly — the cost curve applies to both the audit and the reporting, and they route differently. A missing canonical is a Tier 1 find and a Tier 3 explanation.
"Confidence theater" is the sharpest way I've seen the problem named. You're paying the model to sound certain about something a regex already knew. The cost isn't just dollars — it's latency and the hallucination surface on data that had a deterministic answer. What does your output layer look like for the business owner translation — templated prompts per issue type, or does the model generate the explanation from the raw audit result each time?
That’s perfect ! Most of us get used to the ease of using ai for everything, that we sometimes forget how we could do it all for free.
The "free first" instinct is the one worth building back. Regex doesn't hallucinate either.
Win win win