close

DEV Community

Cover image for Part 1: The Architecture & The Agent - Spec-Driven ML Development With Warp/Oz
syamaner
syamaner

Posted on • Edited on

Part 1: The Architecture & The Agent - Spec-Driven ML Development With Warp/Oz

Last year I built a prototype coffee first crack detector and wrote about it in a 3 part series. The prototype works. I have been running it on my own roasts since November. But it carries the technical debt of something built to prove a concept rather than to last.

This series is the production rebuild. I built a system that detects coffee first crack in real-time using an AI model running on a Raspberry Pi, and I didn't write most of the code. The first stable rebuild snapshot reached 97.4% accuracy with zero false positives on its original test split. The current public baseline, dataset size, and Raspberry Pi 5 benchmarks now live in the model and dataset cards. The full pipeline, covering data preparation, training, evaluation, ONNX INT8 export, edge validation, and a Gradio UI, shipped in two evenings.

I didn't build this by hand-coding every part of the codebase myself. I acted as the engineering lead and ML reviewer, while Warp and its AI agent, Oz, handled much of the implementation from inside my terminal. My responsibilities were primarily architectural and scientific:

  • Designing the workflow: Setting the strict rules of engagement between the agent and the codebase.
  • Defining the science: Dictating the specs, testing strategy, evaluation metrics, and dataset annotation approach.
  • Directing the execution: Guiding the agent through the implementation and reviewing the output.

Operating this way over the weekend, I tracked the rebuild as an 18-story (at the time) epic across 10 pull requests. Warp/Oz handled much of the execution, 46 commits were explicitly co-authored by the agent, and Copilot reviewed each PR across 26 review batches. The model is published on Hugging Face, the dataset is open-sourced, and the source is on GitHub.

This post is about the system that made that possible, not the model itself. The ML science comes in Posts 2 and 3. Here, I want to show the exact architecture I used to direct an AI agent through a complex, multi-phase ML project without losing control of the engineering decisions that matter.

Before the agent could train anything, I had to build the training data from scratch. I could not find a suitable public audio dataset for coffee-roasting first-crack detection, so I recorded roasting sessions, annotated them in Label Studio, and built a recording-level pipeline to avoid leakage between train, validation, and test recordings. The full data engineering story is in Post 2.

From Prototype to Production

The prototype had accumulated real technical debt. The code was monolithic, the model had no reusable packaging, the MCP server architecture had flaws I had been working around, and nothing ran on edge hardware. I had to use my laptop for every roast.

This series covers the production rebuild. Same domain, completely new architecture:

  • A standalone, Hugging Face-native training repository.
  • Strict data engineering to prevent audio leakage.
  • ONNX INT8 quantization for Raspberry Pi 5 edge deployment.
  • A live Gradio Space for public inference.

The Director/Coder Dynamic

The core pattern was a strict, enforced separation of concerns between three actors:

I (the human) owned:

  • The Architecture: Defining repository structure, module boundaries, and enforcing Hugging Face's save_pretrained/from_pretrained as the standard packaging contract.
  • The ML Science: Model selection (AST over CNN), data split strategy (recording-level to prevent leakage), class weighting, and hyperparameter choices.
  • The Workflow Constraints: Defining the project rules, writing the parameterised skills, and managing the state of the epic.
  • The Quality Gates: Reviewing every PR, interpreting the evaluation metrics, and deciding when to retrain versus when to ship.

Oz (Warp's terminal-native agent) owned:

  • Terminal Execution: Running training loops, evaluations, ONNX exports, and SSH sessions directly on the Raspberry Pi.
  • Code Generation: Writing the boilerplate WeightedLossTrainer subclasses, CLI argument parsers, pytest scaffolds, and audio data loaders.
  • Playbook Invocation: Executing repo-local SKILL.md playbooks (for example under .claude/skills/) that encode exact command sequences and validation checks.
  • State Management: Reading the epic document, updating context, and checking off stories after completing a phase.

GitHub Copilot owned:

  • Async Code Review: Flagging type safety issues, API misuse, missing error handling, and dependency hygiene across all 10 PRs.
  • The Reality Check: Copilot never once caught a machine learning logic error. Every data leakage fix, hyperparameter correction, and precision/recall tradeoff decision came from Oz and I. Copilot acts as an aggressive linter for code, not a reviewer for ML science.

This three-way split was not just a convention; it was encoded into the project via an AGENTS.md file. In Warp, project rules from AGENTS.md are applied automatically when the agent is working inside the repository, so Oz started each task inside that rulebook.

The Agentic Setup: AGENTS.md, Epics, and Skills

Three files controlled the entire project.

1. AGENTS.md: The Rulebook

This file sits at the repository root. In Warp, project rules in AGENTS.md are applied automatically when the agent is working in the repository. It contains the project rules, quick commands, codebase architecture, and platform-specific constraints. Here is the exact rules section from this project:

## Rules

- Python 3.11+ with full type hints on all public functions and methods
- Google-style docstrings
- `ruff check` and `ruff format` must pass before marking code complete
- `pyright` must pass with no errors on new code
- All dependencies declared in `pyproject.toml` - never install ad-hoc
- Large files (WAV, checkpoints, ONNX models) go to Hugging Face Hub - never commit to git
- `data/`, `experiments/`, and `exports/` are `.gitignore`'d - keep them that way
- Seed all RNG using `configs/default.yaml` seed value
- One PR per story, branch: `feature/{issue-number}-{slug}`
- Before starting a task: read `docs/state/registry.md` → open epic file → check GitHub issue
Enter fullscreen mode Exit fullscreen mode

That last line is the critical one. In practice, it creates a state-reading loop before code changes begin. On a long-running epic, that was the difference between working from current project state and working from stale context.

The file also includes a codebase architecture map, quick commands for every operation (training, evaluation, export, benchmarking), and platform-specific notes for MPS, CUDA, and the RPi5. The full file is on GitHub.

2. Epic State Management: The Checklist

A registry file (docs/state/registry.md) points to the active epic. The epic file itself (docs/state/epics/coffee-first-crack-detection.md) contains 18 stories grouped into 6 phases, each linked to a GitHub issue. Before and after every task, the agent reads the epic state and updates it according to this protocol:

Before starting any task:
1. Read docs/state/registry.md to find the active epic
2. Open the epic file - check story status
3. Open the GitHub story issue - read comments for latest requirements
4. Work on a branch: feature/{issue-number}-{slug}

After completing a story:
1. Check off the story in the epic doc
2. Update Active Context section with what was built
3. Comment on the GitHub story issue, then close it
4. Tick the checkbox in GitHub epic issue #1
5. Open a PR referencing the story issue
Enter fullscreen mode Exit fullscreen mode

This is how 18 stories were delivered without losing track of what was done, what was next, or what had changed. More precisely, the project kept an externalised state in files and issue links that the agent could read and update, rather than relying on short-lived conversational context alone.

Here is Oz running the earlier phase-6 data preparation pipeline, chunking 973 audio segments, performing the recording-level split, and then invoking the /train-model skill:

Oz Train model skill invocation

3. Parameterised Skills: The Playbooks

These repo-local SKILL.md playbooks live under .claude/skills/ and encode exact command sequences for common operations. Each playbook defines the prerequisites, the commands, and the validation steps. I wrote four:

  • train-model/SKILL.md: End-to-end training with data validation and checkpoint saving.
  • evaluate-model/SKILL.md: Test-set evaluation with metrics report generation.
  • export-onnx/SKILL.md: ONNX export (FP32 + INT8) with size and latency benchmarking.
  • push-to-hub/SKILL.md: Publish model and dataset to the Hugging Face Hub.

When I told Oz to "train the model," it didn't improvise. It read the skill file and followed the exact sequence I defined. This eliminated an entire class of errors where the agent guesses at flags, skips validation steps, or forgets to save the feature extractor configuration alongside the model weights.

Here is Oz chaining the /export-onnx and /push-to-hub skills to export the model and publish everything to Hugging Face Hub in a single sequence:

Oz Export ONNX and Push to HF Hub Skill

A Generalised AGENTS.md Template

Here is a stripped-down version you can drop into any project. Replace the placeholders with your domain-specific rules.

This file is not documentation for humans. It is a system prompt for your codebase. Every rule you omit is a decision the agent will make on its own, and it will make it differently every time.

# AGENTS.md - [Project Name]

Project rules and context for AI coding agents.

## Rules
- [Language] [version]+ with [typing/linting requirements]
- [Formatter] and [linter] must pass before marking code complete
- All dependencies declared in [manifest file] - never install ad-hoc
- Large files go to [remote storage] - never commit to git
- Before starting a task: read `docs/state/registry.md` → open epic → check issue

## Quick Commands
### Setup
[environment setup commands]

### Build / Test / Deploy
[the exact commands for each operation]

## Codebase Architecture
[directory tree with one-line descriptions per module]

## Epic State Management
Before starting any task:
1. Read docs/state/registry.md
2. Check story status in the epic file
3. Read the GitHub issue for latest requirements
4. Branch: feature/{issue-number}-{slug}

After completing a story:
1. Check off the story in the epic doc
2. Update Active Context
3. Close the GitHub issue
4. Open a PR
Enter fullscreen mode Exit fullscreen mode

The Build & The Fails

The first commit after the initial scaffold was feat(S5/S6/S8): implement train.py, evaluate.py, inference.py. In a single pass, Oz generated the training pipeline, evaluation harness, and sliding-window inference module. It followed the AGENTS.md rules, used the correct base model (MIT/ast-finetuned-audioset-10-10-0.4593), and wired up the WeightedLossTrainer subclass with class-weighted CrossEntropyLoss exactly as I specified.

Then training failed.

The input_features vs input_values Bug

Oz wrote the dataset adapter to return input_features as the tensor key, a plausible mistake if you have seen other Hugging Face audio pipelines. But for AST the expected key is input_values, not input_features, so the model input contract was wrong until I corrected it.

Here is the exact diff from the fix commit (75bbb4b):

# src/coffee_first_crack/train.py - _HFDatasetAdapter.__getitem__
-            "input_features": inputs["input_features"].squeeze(0),
+            "input_values": inputs["input_values"].squeeze(0),
Enter fullscreen mode Exit fullscreen mode

It was a one-line bug, but a consequential one. Oz was pattern-matching from a common Hugging Face audio convention where input_features is the right key. For AST, the right key is input_values. More broadly, the Hugging Face audio stack still exposes inconsistent input naming across model families, so model-specific checks matter even when the surrounding pipeline looks standard. This is a known, unresolved inconsistency in the Hugging Face audio API.

The same commit also added accelerate>=0.26.0 to pyproject.toml, which exposed a missing runtime dependency in the training stack. That fits the current Hugging Face training stack as well, since Trainer is powered by Accelerate under the hood. Oz didn't catch it during code generation because it never triggered an ImportError until actual training.

Here is the earlier 15-roast baseline evaluated on a Raspberry Pi 5 (191 test samples, INT8 quantised, 4 threads, via SSH from Warp):

This is what the validation loop looks like in practice. Oz hits a pyright failure, diagnoses the type issues, fixes them, then runs the full ruff checkruff formatpyrightpytest chain until all checks pass:

Static code checking using Pyright and Ruff

Copilot as the Third Actor

Across the 10 PRs in this project, Copilot submitted 26 review batches containing 110 individual comments. Here is how they broke down by PR:

  • PR #23 (RPi5 ONNX validation): 36 comments across 6 review rounds, the most reviewed PR by far.
  • PR #17 (Export, scripts, tests): 26 comments across 5 rounds.
  • PR #27 (Data prep + mic-2 expansion): 16 comments across 3 rounds.
  • PR #16 (Train, eval, inference): 10 comments.
  • PR #28 (Gradio Space): 9 comments across 3 rounds.

The pattern was consistent. Copilot caught:

  • Type safety: Missing type hints, incorrect return types, untyped function signatures.
  • Unused imports: Dead code left behind after refactoring.
  • API misuse: Deprecated parameters, missing synchronisation calls, incorrect exception handling.
  • Dependency hygiene: Missing explicit dependencies, version pinning issues.
  • Docs and copy: Misleading docstrings, inaccurate UI text in the Gradio Space.

However, Copilot did not catch the core machine learning logic issues. To be fair, this is largely because my workflow required me to intercept them before they ever reached a PR:

  • The input_features vs input_values key mismatch: This was fixed locally during the active dev loop before opening the PR.
  • Data leakage from chunk-level splitting: This is the biggest ML risk in this project, but this was addressed architecturally during the setup phase.
  • Hyperparameter choices: Overfitting issues were identified and corrected interactively by reading the local training logs.
  • The precision/recall tradeoff: The class weighting strategy was a deliberate human decision delivered prior to code review.

This is not a criticism of Copilot. It is doing exactly what it should: catching code-level defects at review time. But if you are relying on AI code review to validate your ML pipeline logic, you will ship broken models with clean code.

By the Numbers

Editorial note: the table below reflects the initial rebuild snapshot described in this post. The current public baseline, dataset counts, and Raspberry Pi benchmarks are maintained in the Hugging Face model and dataset cards.

Metric Result
Wall-clock time Two evenings (Fri → Sat)
Stories completed 18 across 6 phases
Pull requests 10 merged
Total commits 61 (52 non-merge)
Oz co-authored 46 commits
Lines of code 11,087 insertions across 75 files
Copilot reviews 26 batches, 110 individual comments
Model accuracy 97.4% test / 100% precision on the initial 15-roast rebuild snapshot
Edge latency 2.09s per 10s window on the initial 15-roast rebuild snapshot (RPi5, INT8, 4 threads)
Dataset Initial release: 973 chunks / 15 roasts (see the dataset card for the current public count)

The live model, dataset, and source are all public. I now treat the model card and dataset card as the canonical source for the latest metrics, dataset counts, and Raspberry Pi benchmark details.

Next up: Post 2, The Data, covers how I built the dataset for coffee-roasting first-crack detection, how I handled annotation and recording-level splits, and how those decisions shaped the precision/recall tradeoff.

Upload a 10 second roasting clip or use an existing sample:

Links

Project:

Tools:

Top comments (0)