close

DEV Community

Cover image for How I Test My MCP Agent Without Burning Tokens
Joseph Thomas
Joseph Thomas

Posted on

How I Test My MCP Agent Without Burning Tokens

Last month I shipped an MCP agent that triages GitHub issues. It works great — until it silently breaks and nobody notices.

Here are the last three bugs I hit:

  1. I tweaked the system prompt. The agent stopped calling create_issue and just summarised the bug report in plain text. CI didn't catch it — CI tests the code, not the agent behavior.

  2. I swapped Sonnet for Haiku to save cost. The agent started calling list_issues four times before each create_issue. Integration tests still passed. Token bill tripled.

  3. GitHub rate-limited me mid-test. The entire pytest suite went red. I rolled back a perfectly good change because I couldn't tell flake from regression.

Every one of those would have been caught by a tool that tests the agent's trajectory — which tools it picks, in what order, with what arguments — against a fast, hermetic mock.

That tool is mcptest. Here's how I use it.


The Scenario

I have an agent that reads a bug report and decides whether to:

  • Open a new issue if the bug is novel
  • Comment on an existing issue if a similar one is already filed
  • Do nothing if the report is spam or unclear

The agent is about thirty lines of Python wrapping an LLM call and the MCP client:

# agent.py
import asyncio, json, os, sys
from mcp import ClientSession
from mcp.client.stdio import StdioServerParameters, stdio_client

SYSTEM_PROMPT = """You are an issue triage assistant for acme/api.
Before filing a new issue, ALWAYS check list_issues for duplicates.
If a duplicate exists, call add_comment instead."""

async def main():
    fx = json.loads(os.environ["MCPTEST_FIXTURES"])[0]
    params = StdioServerParameters(
        command=sys.executable,
        args=["-m", "mcptest.mock_server", fx],
        env=os.environ.copy(),
    )
    user = sys.stdin.read()
    async with stdio_client(params) as (r, w):
        async with ClientSession(r, w) as session:
            await session.initialize()
            tools = await session.list_tools()
            # Your real agent calls an LLM here with SYSTEM_PROMPT + tools
            # and executes whatever tool_calls it returns
            plan = your_llm_plan(SYSTEM_PROMPT, user, tools)
            for call in plan:
                await session.call_tool(call.name, arguments=call.args)

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

That's it. The rest of this post is about testing what this agent does — which tools it picks, in what order — without ever calling a real LLM.

I want to verify four things:

  1. For a clear bug report, it calls create_issue exactly once.
  2. It checks list_issues before create_issue (duplicate check).
  3. When the server rate-limits it, it recovers gracefully.
  4. It never calls delete_issue. Ever.

Step 1 — Mock the GitHub MCP Server in YAML

No code required. Declare the tools, canned responses, and error scenarios:

# fixtures/github.yaml
server:
  name: mock-github
  version: "1.0"

tools:
  - name: list_issues
    input_schema:
      type: object
      properties:
        repo: { type: string }
        query: { type: string }
      required: [repo]
    responses:
      - match: { query: "login 500" }
        return:
          issues: []                   # No duplicate → should open a new one
      - match: { query: "dark mode" }
        return:
          issues: [{ number: 12, title: "Add dark mode" }]
      - default: true
        return: { issues: [] }

  - name: create_issue
    input_schema:
      type: object
      properties:
        repo: { type: string }
        title: { type: string }
        body: { type: string }
      required: [repo, title]
    responses:
      - match: { repo: "acme/api" }
        return: { number: 42, url: "https://github.com/acme/api/issues/42" }
      - default: true
        error: rate_limited            # Simulate real-world flake

  - name: add_comment
    input_schema:
      type: object
      properties:
        repo: { type: string }
        number: { type: integer }
        body: { type: string }
      required: [repo, number, body]
    responses:
      - default: true
        return: { ok: true }

  - name: delete_issue
    responses:
      - default: true
        return: { deleted: true }

errors:
  - name: rate_limited
    error_code: -32000
    message: "GitHub API rate limit exceeded"
Enter fullscreen mode Exit fullscreen mode

That's a real MCP server. It speaks MCP over stdio, just like the real GitHub server. Your agent connects to it the same way.


Step 2 — Write the Tests

# tests/test_triage.yaml
name: triage agent
fixtures:
  - ../fixtures/github.yaml
agent:
  command: python agent.py
cases:

  - name: opens a new issue for a novel bug
    input: "login page returns 500 on Safari"
    assertions:
      - tool_called: create_issue
      - tool_call_count: { tool: create_issue, count: 1 }
      - param_matches:
          tool: create_issue
          param: repo
          value: "acme/api"
      - no_errors: true

  - name: checks duplicates before creating
    input: "login page returns 500 on Safari"
    assertions:
      - tool_order:
          - list_issues
          - create_issue

  - name: comments on an existing duplicate instead of creating
    input: "add dark mode support"
    assertions:
      - tool_called: add_comment
      - tool_not_called: create_issue

  - name: never deletes anything
    input: "spam: buy crypto now!!!"
    assertions:
      - tool_not_called: delete_issue

  - name: recovers from rate-limit gracefully
    input: "file bug for org: wrongorg/wrongrepo"
    assertions:
      - tool_called: create_issue
      - error_handled: "rate limit"
Enter fullscreen mode Exit fullscreen mode

Five test cases. Every one maps directly to a bug I've actually hit in production.


Step 3 — Run It

$ mcptest run

                        mcptest results
┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Suite          ┃ Case                               ┃ Status ┃
┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ triage agent   │ opens a new issue for a novel bug  │  PASS  │
│ triage agent   │ checks duplicates before creating  │  PASS  │
│ triage agent   │ comments on existing duplicate     │  PASS  │
│ triage agent   │ never deletes anything             │  PASS  │
│ triage agent   │ recovers from rate-limit gracefully│  PASS  │
└────────────────┴────────────────────────────────────┴────────┘

5 passed, 0 failed (5 total)
⏱  1.6s
Enter fullscreen mode Exit fullscreen mode

1.6 seconds. Zero tokens. No GitHub API calls. No rate-limit flake.


Step 4 — The Regression-Diff Trick

This is where the tool earns its keep.

First, snapshot the current (known-good) agent trajectories as baselines:

$ mcptest snapshot
✓ saved baseline for triage agent::opens a new issue... (2 tool calls)
✓ saved baseline for triage agent::checks duplicates... (2 tool calls)
✓ saved baseline for triage agent::comments on duplicate (2 tool calls)
...
Enter fullscreen mode Exit fullscreen mode

Now go tweak the system prompt. Something innocuous — change this:

"You are a GitHub issue triage assistant. Check for duplicates before filing."

To this:

"You are a helpful assistant that handles GitHub issue reports."

No [ERROR] in the code. All unit tests still pass. Linter is happy.

$ mcptest diff --ci

✗ triage agent::checks duplicates before creating
  tool_order REGRESSION:
    baseline: list_issues → create_issue
    current:  create_issue          ← list_issues was dropped!

✗ triage agent::comments on existing duplicate
  tool_called REGRESSION:
    baseline: add_comment was called
    current:  add_comment was never called

2 regression(s) across 5 case(s)
Exit: 1
Enter fullscreen mode Exit fullscreen mode

Exit code 1 — CI blocks the merge. The agent silently lost its duplicate-check behavior because of a one-sentence prompt change.

This is the bug that cost me a weekend. I never want to hit it again.


Step 5 — Wire It into CI

# .github/workflows/agent-tests.yml
name: Agent tests
on: [pull_request]

jobs:
  mcptest:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }

      - run: pip install mcp-agent-test

      - name: Run tests
        run: mcptest run

      - name: Diff against baselines
        run: mcptest diff --ci

      - name: Post PR summary
        if: always()
        run: mcptest github-comment
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Enter fullscreen mode Exit fullscreen mode

Now every PR that changes the prompt, the model, or the agent code gets its trajectories diffed against main. If behavior shifts, a comment lands on the PR with the exact tool-order delta. Reviewers see agent behavior changed as clearly as they see code changed.


Why MCP-Specific Matters

Eval tooling is consolidating fast — independent evaluation startups keep getting folded into model-provider platforms. That's useful if you're all-in on one vendor. It's a lock-in risk if you're not.

mcptest is independent, MIT-licensed, and specifically shaped for MCP agents. The tool_called / tool_order / error_handled primitives exist because that's what an MCP trajectory actually looks like — not because someone ported a generic LLM-eval DSL.

MCP agent testing is particularly underserved. DeepEval is great for prompt evaluation. Inspect AI is great for benchmarks. Neither gives you "run your agent against a mock GitHub server and assert it didn't call delete_issue."

mcptest does.


Try It

pip install mcp-agent-test

# Scaffold a new project
mcptest init

# Or clone the quickstart
git clone https://github.com/josephgec/mcptest
cd mcptest/examples/quickstart
mcptest run

# Or install a pre-built fixture pack
mcptest install-pack github ./my-project
mcptest install-pack slack ./my-project
mcptest install-pack filesystem ./my-project
Enter fullscreen mode Exit fullscreen mode

Six packs ship out of the box: GitHub, Slack, filesystem, database, HTTP, and git. Each one is a realistic mock with error scenarios baked in and tests that actually assert something.

📦 Source: github.com/josephgec/mcptest
📦 PyPI: pypi.org/project/mcp-agent-test


If you're building an MCP agent and haven't started writing tests yet, you're accumulating the same three bugs I accumulated. Start with one fixture and one test case. Catch the first prompt-change regression. Then you'll understand why MCP agents need this.

If this was useful, a ⭐ on the repo helps others find it.

Top comments (0)