Originally published at chudi.dev
My first automated bug bounty scan found 47 "critical" vulnerabilities.
I submitted 12 reports. Every single one was a false positive.
The program I targeted now knows my name. Not in a good way.
That specific embarrassment is what made me rebuild everything from scratch. Not a faster scanner. Not a better scanner. A fundamentally different approach to what automation should and shouldn't do in security research.
This guide is the result: a complete system for bug bounty automation that actually works in production.
What Bug Bounty Automation Actually Is (and Isn't)
Bug bounty automation is not a script that finds vulnerabilities for you.
That framing leads directly to 47 false positive submissions and a wrecked reputation.
What it actually is: a system that handles the mechanical parts of security research — reconnaissance, asset discovery, initial scanning — while keeping humans in control of the decision that matters most: what to submit.
The best automation makes you a more effective researcher. It doesn't replace your judgment. It amplifies it.
What automation handles well:
- Subdomain enumeration across certificate transparency logs
- Technology fingerprinting at scale
- Running known payload patterns against hundreds of endpoints simultaneously
- Tracking which findings have been validated vs. just detected
- Generating properly formatted reports for each platform's requirements
What automation handles poorly:
- Novel vulnerability classes that don't match existing patterns
- Context-aware exploitation (is this XSS actually exploitable in this specific app context?)
- Deciding whether a finding is worth a researcher's reputation
- Anything that requires reading the room on a specific target
Understanding this division is more important than any technical decision you'll make.
The Core Architecture: 4 Agents, One Orchestrator
After rebuilding the system twice, the architecture that works is a 4-agent pipeline coordinated by a central orchestrator.
Orchestrator (Claude Opus)
├── Recon Agents (parallel)
├── Testing Agents (max 4 concurrent)
├── Validation Agent (single, evidence-gated)
└── Reporter Agent (platform-specific formatters)
The orchestrator is a project manager, not a worker. It distributes tasks, manages rate limit budgets, detects agent failures, and persists session state between runs. It never touches an endpoint directly.
Recon Agents
Recon runs in parallel across multiple discovery methods:
- Subdomain enumeration via certificate transparency (crt.sh, Censys)
- Technology fingerprinting with httpx to identify frameworks, servers, CDNs
- JavaScript analysis for hidden endpoints, API keys in source, internal route paths
- GraphQL introspection where applicable
All discovered assets feed into a shared SQLite database. Recon agents never block each other — if subdomain enum hits a rate limit, JavaScript analysis keeps running.
Testing Agents
Testing agents take the recon output and probe for vulnerabilities. I cap these at 4 concurrent to avoid triggering WAFs or rate limits.
What they test:
- IDOR: multi-account replay of authenticated requests
- XSS: payload injection with response diff analysis
- SQL injection: error-based and time-based patterns
- SSRF: metadata service probing, internal network access
- Authentication issues: token fixation, session handling edge cases
Each testing agent handles one vulnerability class. Failure is isolated — if the IDOR agent crashes, XSS testing continues unaffected.
Validation Agent: The Most Important Part
Here's the thing most bug bounty automation gets wrong: detection is not exploitation.
My payload appearing in a response means nothing. It might be in an error log that's never rendered, in an HTML attribute that's properly escaped, on a WAF block page, or in a JSON response that's never interpreted as HTML.
The Validation Agent's only job is to disprove findings.
The evidence gate process:
Every finding starts with a confidence score of 0.0 to 1.0 based on initial detection (around 0.3 for most). Confidence determines routing, not just advancement:
| Confidence | Action |
|---|---|
| 0.85+ | Immediate human review queue |
| 0.70–0.84 | Same-day batch review |
| 0.40–0.69 | Weekly review |
| Below 0.40 | Discarded, pattern logged |
To reach 0.85+:
- Baseline capture: Normal request with innocuous input. Record response headers, body length, content type.
- PoC execution: Same request with malicious payload in a sandboxed environment.
- Response diff analysis: Not "does the response contain my payload?" but "does the response differ from baseline in an exploitable way?"
- False positive signature matching: Known-harmless patterns get auto-dismissed.
If PoC succeeds and diff analysis confirms exploitability: confidence rises to 0.85+. Queued for human review.
If PoC fails: confidence drops. Finding goes to weekly batch review, not discarded.
This is adversarial validation. The agent is trying to kill findings. Findings that survive are credible.
Since implementing this: 0 false positives submitted across 3 months.
The finding lifecycle is a state machine. Findings move through defined states with explicit transitions:
States: new → validating → reviewed → submitted / dismissed
new → validating (automatic)
validating → validating (confidence adjustment, up or down)
validating → reviewed (0.70+ confidence)
reviewed → submitted (human approval)
reviewed → dismissed (human rejection)
Confidence isn't binary. A finding can gain or lose credibility based on evidence at every step.
Reporter Agent
Once a finding clears human review and gets approved, the Reporter Agent handles formatting. Every platform has different submission requirements. I built a unified findings model plus platform-specific formatters — write the finding once, output to HackerOne, Intigriti, or Bugcrowd format automatically.
The Learning Layer: SQLite RAG
The piece I didn't plan but won't remove.
Every time an agent hits a rate limit, gets banned, or has a finding dismissed, it logs that to a SQLite database with semantic embeddings. Before running against a new target, the orchestrator queries this database — "have we seen this stack before? what broke?"
After 3 months of data, the system meaningfully avoids mistakes it's already made. That wasn't in the original design. I added it after watching the system make the same rate-limit mistake on three targets in a row. The fourth target, it slowed down automatically. That was the moment I stopped thinking of this as a script.
Three tables do most of the work:
| Table | Purpose |
|---|---|
knowledge_base |
Semantic embeddings of past findings and techniques |
false_positive_signatures |
Known patterns that look like vulnerabilities but aren't |
failure_patterns |
Recovery strategies for different error types |
The first month is calibration, not hunting. The RAG database starts empty. Every finding is evaluated without prior context, so the false positive rate is higher than steady state. By week 2, the system starts filtering patterns it's already rejected. By week 4, confidence scores mean something specific to your programs and testing patterns. Skip the calibration month and month two is chaos.
The Human-in-the-Loop Gate
Full automation for security research is wrong.
Not in a theoretical sense. Wrong in a "your reputation will be destroyed" sense.
Consider two hypothetical researchers. Researcher A submits 200 reports, 50 accepted (25% rate). Researcher B submits 50, 40 accepted (80% rate). Programs trust Researcher B. They triage faster. They pay higher. The acceptance rate compounds over months.
Finding cleared by Validation Agent (confidence 0.85+)
↓
Human review queue (checked once per day)
↓
[APPROVE] → Reporter Agent formats + submits
[DISMISS] → Logged with reason, updates false positive signatures
[INVESTIGATE] → Flagged for manual testing
Every submission has been through my eyes before it goes to a program. Non-negotiable.
What the system will never do:
- Submit reports without human approval
- Test targets outside registered bug bounty programs
- Test out-of-scope domains (hard-blocked before execution, not just warned)
- Exaggerate severity for higher bounties
- Auto-resume after a ban without human authorization
After switching to mandatory human review: acceptance rate went above 80%. Programs respond faster because trust is established. Evidence packages prevent disputes.
The slow-down is worth it. 5 high-quality reports per week beats 50 that damage your reputation.
Validation: Why Detection Isn't Exploitation
The validation layer is what makes or breaks a bug bounty automation system. Most systems skip it. That's why most systems produce garbage.
A scanner finding your payload in a response proves nothing. The payload might appear in an error message that's never rendered. It might appear HTML-escaped in an attribute. It might appear on a WAF block page explaining what was filtered. Every one of those looks like a vulnerability to a pattern matcher. None of them are.
Response diff analysis is the fix. Instead of asking "is my payload in the response?" the validation agent asks "does the response differ from baseline in an exploitable way?"
| Pattern | Why It's a False Positive |
|---|---|
| Payload in error message | Error messages aren't rendered as HTML |
| Payload in JSON response | JSON with correct Content-Type isn't executed |
<script> in HTML |
Properly escaped, not XSS |
| 403 response with payload | WAF blocked it, not vulnerable |
Reflected in src="" attribute |
Often non-exploitable context |
| SQL syntax error on invalid input | Input validation, not injection |
For XSS specifically: regex can't tell you if JavaScript executes. Browser validation via Playwright loads the target page, injects a marker that fires if code runs, and checks whether that marker triggers. If alert() fires, XSS is confirmed. If not — regardless of how "vulnerable" the response looks — the finding gets rejected.
The false positive signatures database stores every pattern the system has learned to dismiss. Every rejected finding adds to it. After 3 months, it filters hundreds of known-harmless patterns before they reach the review queue.
Before validation: ~40 findings per scan, 2-3 valid (90%+ false positive rate).
After validation: ~40 detections, 8-12 survive for human review, 5-7 valid (~40% false positive rate at review stage).
Still not perfect. But humans now review 12 findings instead of 40 — and 60% of what they see is real.
Failure Recovery: The 6 Categories
My testing agent hit a rate limit at 2 AM. It retried immediately. Got rate limited again. Retried. Rate limited. Retried faster. By morning, I was IP-banned from the target's entire infrastructure.
That specific failure taught me that error handling in security automation isn't optional. Generic retry loops make things worse. Every error needs classification first.
| Category | Detection Pattern | Recovery Strategy |
|---|---|---|
| Rate Limit | HTTP 429, "too many requests" | Exponential backoff (2x multiplier, 1hr max) |
| Ban Detected | CAPTCHA, IP block, consecutive 403s | Immediate halt + human alert |
| Auth Error | 401, expired token, invalid session | Credential refresh + retry (3 max) |
| Timeout | No response >30 seconds | Reduce parallelism + extend timeout |
| Scope Violation | Testing out-of-scope domain | Remove from queue + blacklist |
| False Positive | Validation rejection | Log pattern + update signatures |
Exponential backoff for rate limits: 30s, 60s, 120s, 240s, capped at 1 hour. The ceiling matters. HackerOne resets rate limits every 15 minutes — waiting 4 hours wastes time.
Ban detection has highest priority. It checks before rate limit detection. When triggered: all agents stop immediately, human alert fires, session state saves for investigation. Never auto-resume. Human must explicitly authorize continuation.
Escalation threshold: same error category 5+ times within 5 minutes triggers human intervention. First-occurrence rate limits and single timeouts never escalate.
Before categorized recovery: ~30% of scans interrupted by unhandled errors, bans monthly.
After: ~5% need human intervention, zero bans in 6 months, 200+ learned error signatures.
Multi-Platform Integration
HackerOne needs severity ratings with their specific weakness taxonomy. Intigriti wants different field names and inline severity justification. Bugcrowd has unique bounty table structures. Without a unified model, you end up maintaining three separate report generators for the same findings.
The approach that works: one internal findings model with three platform-specific formatters. Every agent works with the unified model. Platform awareness lives only at two boundaries — ingestion (pulling scope from platforms) and submission (sending reports to platforms). Everything between is platform-agnostic.
interface Finding {
id: string;
title: string;
description: string;
vulnerabilityType: VulnType;
cvssVector: string; // Full CVSS v3.1 vector
cvssScore: number; // Calculated from vector
severity: 'critical' | 'high' | 'medium' | 'low' | 'informational';
poc: { steps: string[]; curl?: string; script?: string; };
evidence: { screenshots: string[]; requestResponse: string[]; hashes: string[]; };
confidence: number;
status: FindingStatus;
}
Each platform formatter implements the same interface: format, validate, submit. They transform the unified Finding into what each platform expects. HackerOne maps vulnerability types to their weakness taxonomy IDs. Intigriti uses different field names. Bugcrowd requires bounty table entries mapped from severity.
The Budget Manager tracks API rate quotas per platform. Before every API call, agents check canRead() or canWrite(). If exhausted, the request queues until quota resets.
A first-mover priority system monitors all three platforms for programs launched in the last 24 hours. New programs get immediate passive recon. Active testing starts after a 2-4 hour delay for scope to stabilize. Early submissions on new programs have higher acceptance rates — less competition, more unreported surface area.
Tools and Stack
- Orchestration: Claude Opus (orchestrator), Claude Haiku (testing agents)
- Recon: httpx, subfinder, amass, crt.sh API
- Testing: Custom Python agents per vulnerability class, Playwright for JS analysis
- Validation: Docker sandboxed execution, custom response diff library
- Storage: SQLite with sqlite-vec for semantic search
- Platform integration: HackerOne API, Intigriti API, Bugcrowd API
- Infrastructure: VPS ($40/mo) — not serverless, you need persistent state. See my Python agent deployment guide for setup
- Total monthly cost: ~$180 ($40 VPS + ~$140 Claude API)
What I'd Do Differently
Start with the Validation Agent, not the scanner. The scanner is interesting. The validation layer is what actually matters. Build it first.
Cap concurrent agents at 4 from day one. Started with 10. Got IP-banned from 3 programs in two weeks.
Build the human review queue before anything else. The moment you can submit without a gate is the moment you will. Build the gate first.
Accept that it won't make you rich quickly. This system makes you roughly 3.5x more effective. That's the actual value proposition.
Current Results (3 Months In)
- 12 active programs being monitored
- ~30 findings surfaced for human review per week
- ~4-6 submitted after review
- 0 false positives submitted
- ~$180/month running cost
- ~3.5x throughput increase vs. manual research
Building something similar? The hardest part is the validation layer. Start there — everything else is just plumbing.
The multi-agent patterns behind this system are in the Battle-Tested Builder Kit — CLAUDE.md templates, agent routing rules, and verification gates you can drop into your own projects.
Sources
- OWASP Web Security Testing Guide (OWASP)
- OWASP Top Ten (OWASP)
- MITRE CWE (MITRE)
Top comments (0)