Chudi Nnorukam

Posted on Apr 10 • Edited on Apr 18 • Originally published at chudi.dev

Bug Bounty Automation: Building Security Workflows That Scale

#bugbounty #automation #multiagent #security

Originally published at chudi.dev

My first automated bug bounty scan found 47 "critical" vulnerabilities.

I submitted 12 reports. Every single one was a false positive.

The program I targeted now knows my name. Not in a good way.

That specific embarrassment is what made me rebuild everything from scratch. Not a faster scanner. Not a better scanner. A fundamentally different approach to what automation should and shouldn't do in security research.

This guide is the result: a complete system for bug bounty automation that actually works in production.

What Bug Bounty Automation Actually Is (and Isn't)

Bug bounty automation is not a script that finds vulnerabilities for you.

That framing leads directly to 47 false positive submissions and a wrecked reputation.

What it actually is: a system that handles the mechanical parts of security research — reconnaissance, asset discovery, initial scanning — while keeping humans in control of the decision that matters most: what to submit.

The best automation makes you a more effective researcher. It doesn't replace your judgment. It amplifies it.

What automation handles well:

Subdomain enumeration across certificate transparency logs
Technology fingerprinting at scale
Running known payload patterns against hundreds of endpoints simultaneously
Tracking which findings have been validated vs. just detected
Generating properly formatted reports for each platform's requirements

What automation handles poorly:

Novel vulnerability classes that don't match existing patterns
Context-aware exploitation (is this XSS actually exploitable in this specific app context?)
Deciding whether a finding is worth a researcher's reputation
Anything that requires reading the room on a specific target

Understanding this division is more important than any technical decision you'll make.

The Core Architecture: 4 Agents, One Orchestrator

After rebuilding the system twice, the architecture that works is a 4-agent pipeline coordinated by a central orchestrator.

Orchestrator (Claude Opus)
├── Recon Agents (parallel)
├── Testing Agents (max 4 concurrent)
├── Validation Agent (single, evidence-gated)
└── Reporter Agent (platform-specific formatters)

The orchestrator is a project manager, not a worker. It distributes tasks, manages rate limit budgets, detects agent failures, and persists session state between runs. It never touches an endpoint directly.

Recon Agents

Recon runs in parallel across multiple discovery methods:

Subdomain enumeration via certificate transparency (crt.sh, Censys)
Technology fingerprinting with httpx to identify frameworks, servers, CDNs
JavaScript analysis for hidden endpoints, API keys in source, internal route paths
GraphQL introspection where applicable

All discovered assets feed into a shared SQLite database. Recon agents never block each other — if subdomain enum hits a rate limit, JavaScript analysis keeps running.

Testing Agents

Testing agents take the recon output and probe for vulnerabilities. I cap these at 4 concurrent to avoid triggering WAFs or rate limits.

What they test:

IDOR: multi-account replay of authenticated requests
XSS: payload injection with response diff analysis
SQL injection: error-based and time-based patterns
SSRF: metadata service probing, internal network access
Authentication issues: token fixation, session handling edge cases

Each testing agent handles one vulnerability class. Failure is isolated — if the IDOR agent crashes, XSS testing continues unaffected.

Validation Agent: The Most Important Part

Here's the thing most bug bounty automation gets wrong: detection is not exploitation.

My payload appearing in a response means nothing. It might be in an error log that's never rendered, in an HTML attribute that's properly escaped, on a WAF block page, or in a JSON response that's never interpreted as HTML.

The Validation Agent's only job is to disprove findings.

The evidence gate process:

Every finding starts with a confidence score of 0.0 to 1.0 based on initial detection (around 0.3 for most). Confidence determines routing, not just advancement:

Confidence	Action
0.85+	Immediate human review queue
0.70–0.84	Same-day batch review
0.40–0.69	Weekly review
Below 0.40	Discarded, pattern logged

To reach 0.85+:

Baseline capture: Normal request with innocuous input. Record response headers, body length, content type.
PoC execution: Same request with malicious payload in a sandboxed environment.
Response diff analysis: Not "does the response contain my payload?" but "does the response differ from baseline in an exploitable way?"
False positive signature matching: Known-harmless patterns get auto-dismissed.

If PoC succeeds and diff analysis confirms exploitability: confidence rises to 0.85+. Queued for human review.

If PoC fails: confidence drops. Finding goes to weekly batch review, not discarded.

This is adversarial validation. The agent is trying to kill findings. Findings that survive are credible.

Since implementing this: 0 false positives submitted across 3 months.

The finding lifecycle is a state machine. Findings move through defined states with explicit transitions:

States: new → validating → reviewed → submitted / dismissed

new → validating (automatic)
validating → validating (confidence adjustment, up or down)
validating → reviewed (0.70+ confidence)
reviewed → submitted (human approval)
reviewed → dismissed (human rejection)

Confidence isn't binary. A finding can gain or lose credibility based on evidence at every step.

Reporter Agent

Once a finding clears human review and gets approved, the Reporter Agent handles formatting. Every platform has different submission requirements. I built a unified findings model plus platform-specific formatters — write the finding once, output to HackerOne, Intigriti, or Bugcrowd format automatically.

The Learning Layer: SQLite RAG

The piece I didn't plan but won't remove.

Every time an agent hits a rate limit, gets banned, or has a finding dismissed, it logs that to a SQLite database with semantic embeddings. Before running against a new target, the orchestrator queries this database — "have we seen this stack before? what broke?"

After 3 months of data, the system meaningfully avoids mistakes it's already made. That wasn't in the original design. I added it after watching the system make the same rate-limit mistake on three targets in a row. The fourth target, it slowed down automatically. That was the moment I stopped thinking of this as a script.

Three tables do most of the work:

Table	Purpose
`knowledge_base`	Semantic embeddings of past findings and techniques
`false_positive_signatures`	Known patterns that look like vulnerabilities but aren't
`failure_patterns`	Recovery strategies for different error types

The first month is calibration, not hunting. The RAG database starts empty. Every finding is evaluated without prior context, so the false positive rate is higher than steady state. By week 2, the system starts filtering patterns it's already rejected. By week 4, confidence scores mean something specific to your programs and testing patterns. Skip the calibration month and month two is chaos.

The Human-in-the-Loop Gate

Full automation for security research is wrong.

Not in a theoretical sense. Wrong in a "your reputation will be destroyed" sense.

Consider two hypothetical researchers. Researcher A submits 200 reports, 50 accepted (25% rate). Researcher B submits 50, 40 accepted (80% rate). Programs trust Researcher B. They triage faster. They pay higher. The acceptance rate compounds over months.

Finding cleared by Validation Agent (confidence 0.85+)
    ↓
Human review queue (checked once per day)
    ↓
[APPROVE] → Reporter Agent formats + submits
[DISMISS] → Logged with reason, updates false positive signatures
[INVESTIGATE] → Flagged for manual testing

Every submission has been through my eyes before it goes to a program. Non-negotiable.

What the system will never do:

Submit reports without human approval
Test targets outside registered bug bounty programs
Test out-of-scope domains (hard-blocked before execution, not just warned)
Exaggerate severity for higher bounties
Auto-resume after a ban without human authorization

After switching to mandatory human review: acceptance rate went above 80%. Programs respond faster because trust is established. Evidence packages prevent disputes.

The slow-down is worth it. 5 high-quality reports per week beats 50 that damage your reputation.

Validation: Why Detection Isn't Exploitation

The validation layer is what makes or breaks a bug bounty automation system. Most systems skip it. That's why most systems produce garbage.

A scanner finding your payload in a response proves nothing. The payload might appear in an error message that's never rendered. It might appear HTML-escaped in an attribute. It might appear on a WAF block page explaining what was filtered. Every one of those looks like a vulnerability to a pattern matcher. None of them are.

Response diff analysis is the fix. Instead of asking "is my payload in the response?" the validation agent asks "does the response differ from baseline in an exploitable way?"

Pattern	Why It's a False Positive
Payload in error message	Error messages aren't rendered as HTML
Payload in JSON response	JSON with correct Content-Type isn't executed
`<script>` in HTML	Properly escaped, not XSS
403 response with payload	WAF blocked it, not vulnerable
Reflected in `src=""` attribute	Often non-exploitable context
SQL syntax error on invalid input	Input validation, not injection

For XSS specifically: regex can't tell you if JavaScript executes. Browser validation via Playwright loads the target page, injects a marker that fires if code runs, and checks whether that marker triggers. If alert() fires, XSS is confirmed. If not — regardless of how "vulnerable" the response looks — the finding gets rejected.

The false positive signatures database stores every pattern the system has learned to dismiss. Every rejected finding adds to it. After 3 months, it filters hundreds of known-harmless patterns before they reach the review queue.

Before validation: ~40 findings per scan, 2-3 valid (90%+ false positive rate).
After validation: ~40 detections, 8-12 survive for human review, 5-7 valid (~40% false positive rate at review stage).

Still not perfect. But humans now review 12 findings instead of 40 — and 60% of what they see is real.

Failure Recovery: The 6 Categories

My testing agent hit a rate limit at 2 AM. It retried immediately. Got rate limited again. Retried. Rate limited. Retried faster. By morning, I was IP-banned from the target's entire infrastructure.

That specific failure taught me that error handling in security automation isn't optional. Generic retry loops make things worse. Every error needs classification first.

Category	Detection Pattern	Recovery Strategy
Rate Limit	HTTP 429, "too many requests"	Exponential backoff (2x multiplier, 1hr max)
Ban Detected	CAPTCHA, IP block, consecutive 403s	Immediate halt + human alert
Auth Error	401, expired token, invalid session	Credential refresh + retry (3 max)
Timeout	No response >30 seconds	Reduce parallelism + extend timeout
Scope Violation	Testing out-of-scope domain	Remove from queue + blacklist
False Positive	Validation rejection	Log pattern + update signatures

Exponential backoff for rate limits: 30s, 60s, 120s, 240s, capped at 1 hour. The ceiling matters. HackerOne resets rate limits every 15 minutes — waiting 4 hours wastes time.

Ban detection has highest priority. It checks before rate limit detection. When triggered: all agents stop immediately, human alert fires, session state saves for investigation. Never auto-resume. Human must explicitly authorize continuation.

Escalation threshold: same error category 5+ times within 5 minutes triggers human intervention. First-occurrence rate limits and single timeouts never escalate.

Before categorized recovery: ~30% of scans interrupted by unhandled errors, bans monthly.
After: ~5% need human intervention, zero bans in 6 months, 200+ learned error signatures.

Multi-Platform Integration

HackerOne needs severity ratings with their specific weakness taxonomy. Intigriti wants different field names and inline severity justification. Bugcrowd has unique bounty table structures. Without a unified model, you end up maintaining three separate report generators for the same findings.

The approach that works: one internal findings model with three platform-specific formatters. Every agent works with the unified model. Platform awareness lives only at two boundaries — ingestion (pulling scope from platforms) and submission (sending reports to platforms). Everything between is platform-agnostic.

interface Finding {
  id: string;
  title: string;
  description: string;
  vulnerabilityType: VulnType;
  cvssVector: string;        // Full CVSS v3.1 vector
  cvssScore: number;         // Calculated from vector
  severity: 'critical' | 'high' | 'medium' | 'low' | 'informational';
  poc: { steps: string[]; curl?: string; script?: string; };
  evidence: { screenshots: string[]; requestResponse: string[]; hashes: string[]; };
  confidence: number;
  status: FindingStatus;
}

Each platform formatter implements the same interface: format, validate, submit. They transform the unified Finding into what each platform expects. HackerOne maps vulnerability types to their weakness taxonomy IDs. Intigriti uses different field names. Bugcrowd requires bounty table entries mapped from severity.

The Budget Manager tracks API rate quotas per platform. Before every API call, agents check canRead() or canWrite(). If exhausted, the request queues until quota resets.

A first-mover priority system monitors all three platforms for programs launched in the last 24 hours. New programs get immediate passive recon. Active testing starts after a 2-4 hour delay for scope to stabilize. Early submissions on new programs have higher acceptance rates — less competition, more unreported surface area.

Tools and Stack

Orchestration: Claude Opus (orchestrator), Claude Haiku (testing agents)
Recon: httpx, subfinder, amass, crt.sh API
Testing: Custom Python agents per vulnerability class, Playwright for JS analysis
Validation: Docker sandboxed execution, custom response diff library
Storage: SQLite with sqlite-vec for semantic search
Platform integration: HackerOne API, Intigriti API, Bugcrowd API
Infrastructure: VPS ($40/mo) — not serverless, you need persistent state. See my Python agent deployment guide for setup
Total monthly cost: ~$180 ($40 VPS + ~$140 Claude API)

What I'd Do Differently

Start with the Validation Agent, not the scanner. The scanner is interesting. The validation layer is what actually matters. Build it first.

Cap concurrent agents at 4 from day one. Started with 10. Got IP-banned from 3 programs in two weeks.

Build the human review queue before anything else. The moment you can submit without a gate is the moment you will. Build the gate first.

Accept that it won't make you rich quickly. This system makes you roughly 3.5x more effective. That's the actual value proposition.

Current Results (3 Months In)

12 active programs being monitored
~30 findings surfaced for human review per week
~4-6 submitted after review
0 false positives submitted
~$180/month running cost
~3.5x throughput increase vs. manual research

Building something similar? The hardest part is the validation layer. Start there — everything else is just plumbing.

The multi-agent patterns behind this system are in the Battle-Tested Builder Kit — CLAUDE.md templates, agent routing rules, and verification gates you can drop into your own projects.

Sources

OWASP Web Security Testing Guide (OWASP)
OWASP Top Ten (OWASP)
MITRE CWE (MITRE)

DEV Community