close

DEV Community

Cover image for How I shipped 13 PRs in one afternoon pair-programming with Claude (and what I learned)
Marco
Marco

Posted on

How I shipped 13 PRs in one afternoon pair-programming with Claude (and what I learned)

Last week I was rebuilding parts of JobSearch — an open-source job-search tool I built
for my own job hunt. I had a list of 6 things I wanted to ship: a few features, a couple of tech-debt fixes, and one risky

refactor (migrating Anthropic API calls from text-based JSON parsing to tool use).

I expected to get through 2 maybe 3 of them. Instead I shipped 13 PRs to main, all green CI, all live in production, in
a single afternoon — pair-programming with Claude.

This is what worked, what didn't, and the technical lesson that surprised me the most.

## The setup

JobSearch is a FastAPI + PostgreSQL + Redis monolith deployed on Render. Standard Python stack: 394 tests, mypy strict,

9-stage GitHub Actions CI (ruff, format, bandit, pip-audit, stylelint, CodeQL, pytest, docker build, deploy). Real production
app, used daily by exactly one person — me.

The session started simple: I asked Claude to plan the work in /ultraplan mode, and we got back a 6-PR roadmap ordered by

risk. Then we just… executed it.

## The "onion" workflow

The thing that worked best wasn't speed — it was discipline. Every PR was:

  1. New branch from latest main
  2. Single concern (no mixing tech debt with features)
  3. Tests added
  4. Push → wait for CI green → merge → delete branch
  5. Pull main → start next PR

Sounds basic. But the discipline of "no overlapping PRs, no shortcuts, never merge without 9/9 green checks" is what kept the
velocity sustainable for 6 hours straight. We never had to roll back. Production never broke.

The reason I could keep this pace: Claude handled the typing and the verifications, I handled the decisions and the

priorities
. Classic pair programming, just with one of the pair being an LLM.

## The risky one: tool use migration

The PR I was most worried about was eliminating ~250 LOC of fragile JSON parsing.

Before: every Claude API call returned text. We'd run that text through 7 parsing strategies (_clean_json_text,

_fix_unescaped_newlines, _fix_single_quotes, _strip_markdown_wrapper, _extract_and_parse_json, then a fallback that
asks Claude to fix its own broken JSON).

This worked but it was a band-aid. Every few hundred analyses we'd hit an edge case the parsers didn't cover.

After: every API call uses Anthropic's tool_use with a forced tool_choice. The schema comes from

Pydantic.model_json_schema(). The SDK gives us back a parsed dict directly. Zero local parsing.


python                                                 
  def _call_api_with_tool(
      system_prompt: str,                                                                                                      
      user_prompt: str,
      model_id: str,                                                                                                           
      max_tokens: int,                                                                                                         
      tool_name: str,
      tool_description: str,                                                                                                   
      input_schema: dict[str, Any],                         
  ) -> tuple[dict[str, Any], anthropic.types.Usage]:                                                                           
      message = client.messages.create(                                                                                        
          model=model_id,                                                                                                      
          max_tokens=max_tokens,                                                                                               
          system=[{"type": "text", "text": system_prompt,   
                   "cache_control": {"type": "ephemeral"}}],                                                                   
          messages=[{"role": "user", "content": user_prompt}],
          tools=[{"name": tool_name,                                                                                           
                  "description": tool_description,                                                                             
                  "input_schema": input_schema}],
          tool_choice={"type": "tool", "name": tool_name},                                                                     
      )                                                                                                                        
      for block in message.content:
          if getattr(block, "type", None) != "tool_use":                                                                       
              continue                                      
          tool_input = getattr(block, "input", None)                                                                           
          if isinstance(tool_input, dict):
              return cast(dict[str, Any], tool_input), message.usage                                                           
      raise RuntimeError(f"Expected tool_use block for {tool_name!r}")                                                         

  Net delta: −370 lines across anthropic_client.py + 5 obsolete test classes that no longer made sense. The "JSON parse        
  failure" bug class disappeared from the codebase entirely.                                                                   

  The surprising lesson came the next morning.                                                                                 

  The bug I didn't see coming                                                                                                  

  A day later, I was looking at production data and noticed something off. My job-quality filter wasn't flagging P.IVA (Italian
   self-employment) job ads correctly. Annunci with salary_info: "€2.500-3.000/mese P.IVA" were coming back with is_freelance: 
  false — exactly the opposite of what I wanted.                                                                               

  The prompt rule was technically still there. It just stopped working.                                                        

  After a few minutes of digging, I realized: the rule had a permissive fallback — "if the ad mentions both 'employee or       
  P.IVA', set is_freelance=false (the user can choose employee)". With the old text-parsing pipeline, this fallback rarely
  triggered because the model's output had enough noise that it usually committed to one answer.                               

  With tool use, the model follows the schema with much more literalness. The fallback now triggered too often, because the    
  model treated even a tangential mention of "P.IVA" as an ambiguous case.

  The fix was to harden the prompt rule with explicit precedence: "strong triggers in salary_info (€/day, €/hour, €/month +    
  P.IVA) override every fallback. The exception requires ALL three explicit conditions."

  The takeaway:                                                                                                                

  ▎ When you migrate from text parsing to tool use, your interpretive prompt rules need to be tightened, not loosened. The     
  ▎ model follows the schema more faithfully, but also more literally — ambiguity in the rule becomes ambiguity in the output.

  The pair-programming bit                                                                                                     

  A few observations from 6 hours of this:                                                                                     

  1. Plan first, execute second. The /ultraplan step at the start was worth more than any single PR. Without it we would have  
  drifted into rabbit holes.                                
  2. Reviews stay with the human. Claude wrote the code, but I read every diff before merging. On 2 PRs I rejected Claude's    
  first proposal because it scope-crept.                                                                                       
  3. Wakeups for CI > polling. Instead of asking "is CI green yet?" every 30 seconds, we scheduled a wakeup for the typical
  4-minute CI window. Less context churn, faster perceived workflow.                                                           
  4. Trust but verify. When Claude said "401/401 tests passing," I didn't take it at face value — I ran pytest tests/ -q myself
   before pushing. Twice it had skipped a slow test that was actually failing. Catching that took 10 seconds, not catching it  
  would have cost a CI cycle.                               
  5. No big-bang refactors. Every PR was small enough to revert in one click. The riskiest one (tool use migration, −370 LOC)  
  still went out alone, behind its own version flag.                                                                           

  What's in the repo                                                                                                           

  github.com/MK023/JobSearch — MIT, FastAPI + Anthropic Claude tool use, 401 tests, 9-stage CI, zero-downtime Alembic          
  migrations, weekly cleanup cron via GitHub Actions, persisted user preferences with JSONB whitelist.

  It's built for one user (me, hunting a DevSecOps / Cloud / Python role in Italy 🇮🇹), but the architecture and the patterns   
  are reusable. If anything in here is useful for your own AI-integrated production app, take it.

  Happy to discuss in comments — especially the tool-use migration, which I think is the right default now for any LLM call    
  where you control the schema.

  ---                                   
Enter fullscreen mode Exit fullscreen mode

Top comments (0)