Last week I was rebuilding parts of JobSearch — an open-source job-search tool I built
for my own job hunt. I had a list of 6 things I wanted to ship: a few features, a couple of tech-debt fixes, and one risky
refactor (migrating Anthropic API calls from text-based JSON parsing to tool use).
I expected to get through 2 maybe 3 of them. Instead I shipped 13 PRs to main, all green CI, all live in production, in
a single afternoon — pair-programming with Claude.
This is what worked, what didn't, and the technical lesson that surprised me the most.
## The setup
JobSearch is a FastAPI + PostgreSQL + Redis monolith deployed on Render. Standard Python stack: 394 tests, mypy strict,
9-stage GitHub Actions CI (ruff, format, bandit, pip-audit, stylelint, CodeQL, pytest, docker build, deploy). Real production
app, used daily by exactly one person — me.
The session started simple: I asked Claude to plan the work in /ultraplan mode, and we got back a 6-PR roadmap ordered by
risk. Then we just… executed it.
## The "onion" workflow
The thing that worked best wasn't speed — it was discipline. Every PR was:
- New branch from latest
main - Single concern (no mixing tech debt with features)
- Tests added
- Push → wait for CI green → merge → delete branch
- Pull
main→ start next PR
Sounds basic. But the discipline of "no overlapping PRs, no shortcuts, never merge without 9/9 green checks" is what kept the
velocity sustainable for 6 hours straight. We never had to roll back. Production never broke.
The reason I could keep this pace: Claude handled the typing and the verifications, I handled the decisions and the
priorities. Classic pair programming, just with one of the pair being an LLM.
## The risky one: tool use migration
The PR I was most worried about was eliminating ~250 LOC of fragile JSON parsing.
Before: every Claude API call returned text. We'd run that text through 7 parsing strategies (_clean_json_text,
_fix_unescaped_newlines, _fix_single_quotes, _strip_markdown_wrapper, _extract_and_parse_json, then a fallback that
asks Claude to fix its own broken JSON).
This worked but it was a band-aid. Every few hundred analyses we'd hit an edge case the parsers didn't cover.
After: every API call uses Anthropic's tool_use with a forced tool_choice. The schema comes from
Pydantic.model_json_schema(). The SDK gives us back a parsed dict directly. Zero local parsing.
python
def _call_api_with_tool(
system_prompt: str,
user_prompt: str,
model_id: str,
max_tokens: int,
tool_name: str,
tool_description: str,
input_schema: dict[str, Any],
) -> tuple[dict[str, Any], anthropic.types.Usage]:
message = client.messages.create(
model=model_id,
max_tokens=max_tokens,
system=[{"type": "text", "text": system_prompt,
"cache_control": {"type": "ephemeral"}}],
messages=[{"role": "user", "content": user_prompt}],
tools=[{"name": tool_name,
"description": tool_description,
"input_schema": input_schema}],
tool_choice={"type": "tool", "name": tool_name},
)
for block in message.content:
if getattr(block, "type", None) != "tool_use":
continue
tool_input = getattr(block, "input", None)
if isinstance(tool_input, dict):
return cast(dict[str, Any], tool_input), message.usage
raise RuntimeError(f"Expected tool_use block for {tool_name!r}")
Net delta: −370 lines across anthropic_client.py + 5 obsolete test classes that no longer made sense. The "JSON parse
failure" bug class disappeared from the codebase entirely.
The surprising lesson came the next morning.
The bug I didn't see coming
A day later, I was looking at production data and noticed something off. My job-quality filter wasn't flagging P.IVA (Italian
self-employment) job ads correctly. Annunci with salary_info: "€2.500-3.000/mese P.IVA" were coming back with is_freelance:
false — exactly the opposite of what I wanted.
The prompt rule was technically still there. It just stopped working.
After a few minutes of digging, I realized: the rule had a permissive fallback — "if the ad mentions both 'employee or
P.IVA', set is_freelance=false (the user can choose employee)". With the old text-parsing pipeline, this fallback rarely
triggered because the model's output had enough noise that it usually committed to one answer.
With tool use, the model follows the schema with much more literalness. The fallback now triggered too often, because the
model treated even a tangential mention of "P.IVA" as an ambiguous case.
The fix was to harden the prompt rule with explicit precedence: "strong triggers in salary_info (€/day, €/hour, €/month +
P.IVA) override every fallback. The exception requires ALL three explicit conditions."
The takeaway:
▎ When you migrate from text parsing to tool use, your interpretive prompt rules need to be tightened, not loosened. The
▎ model follows the schema more faithfully, but also more literally — ambiguity in the rule becomes ambiguity in the output.
The pair-programming bit
A few observations from 6 hours of this:
1. Plan first, execute second. The /ultraplan step at the start was worth more than any single PR. Without it we would have
drifted into rabbit holes.
2. Reviews stay with the human. Claude wrote the code, but I read every diff before merging. On 2 PRs I rejected Claude's
first proposal because it scope-crept.
3. Wakeups for CI > polling. Instead of asking "is CI green yet?" every 30 seconds, we scheduled a wakeup for the typical
4-minute CI window. Less context churn, faster perceived workflow.
4. Trust but verify. When Claude said "401/401 tests passing," I didn't take it at face value — I ran pytest tests/ -q myself
before pushing. Twice it had skipped a slow test that was actually failing. Catching that took 10 seconds, not catching it
would have cost a CI cycle.
5. No big-bang refactors. Every PR was small enough to revert in one click. The riskiest one (tool use migration, −370 LOC)
still went out alone, behind its own version flag.
What's in the repo
github.com/MK023/JobSearch — MIT, FastAPI + Anthropic Claude tool use, 401 tests, 9-stage CI, zero-downtime Alembic
migrations, weekly cleanup cron via GitHub Actions, persisted user preferences with JSONB whitelist.
It's built for one user (me, hunting a DevSecOps / Cloud / Python role in Italy 🇮🇹), but the architecture and the patterns
are reusable. If anything in here is useful for your own AI-integrated production app, take it.
Happy to discuss in comments — especially the tool-use migration, which I think is the right default now for any LLM call
where you control the schema.
---
Top comments (0)