Hook
We needed to classify AI systems under the EU AI Act — a legal framework where the same input must always produce the same output. We were using Claude as the backbone. Claude is a language model. Language models are probabilistic by design.
That's the problem. Here's how we solved it without giving up LLM capability.
Context
We're building Complyance, a compliance management tool for companies selling AI into the EU. Under the EU AI Act, each AI system gets a risk classification: UNACCEPTABLE, HIGH, LIMITED, or MINIMAL. This classification has legal consequences — it determines what documentation you must produce, what audits you face, what you're liable for.
That means our classifier can't be "usually right." It has to be reproducible, auditable, and explainable. The same system description must produce the same result, every time, so users can show regulators a consistent record.
We chose Claude Sonnet as our LLM. But we couldn't just pass the user's description to Claude and return whatever it said. We needed a pipeline.
The problem in detail
Three issues with naive LLM classification:
1. Non-determinism. Even at temperature=0, large models can produce slightly different outputs across runs due to hardware floating-point differences. We needed documented, rule-based overrides for the cases where the law is clear.
2. Hallucinated structure. Ask an LLM to return JSON and it will — until it doesn't. Missing fields, wrong types, values outside the valid enum. In production, any of these breaks your application silently.
3. Confidence without calibration. The model says HIGH risk with 0.92 confidence. But does that confidence mean anything? Without validation, you're shipping a number that looks authoritative but isn't.
Naive approach / what didn't work
First version: single prompt, JSON mode, parse the output.
const result = await claude.messages.create({
model: "claude-sonnet-4-5",
temperature: 0,
messages: [{ role: "user", content: buildPrompt(systemData) }],
});
const classification = JSON.parse(result.content[0].text);
// 🚨 crashes when JSON is malformed
// 🚨 no validation of field values
// 🚨 no audit trail for why we got this result
This worked during development. It failed in production when:
- A user described their system in a language other than English (the model sometimes responded in that language, breaking JSON)
- The model returned a
riskLevelof"High"instead of"HIGH"(Zod enum mismatch) - Confidence came back as the string
"0.85"instead of the number0.85
Actual solution
We built a three-stage pipeline: rule-based pre-filter → LLM → validation.
Stage 1: Rule-based pre-filter
Before touching the LLM, we apply hard rules derived directly from the Act text. These override LLM output.
function applyHardRules(input: ClassificationInput): HardRuleResult | null {
// Article 5 — Unacceptable risk (non-negotiable)
if (input.useCase === "social_scoring" && input.deployedBy === "government") {
return { riskLevel: "UNACCEPTABLE", rule: "Article 5(1)(c)", confidence: 1.0 };
}
// Annex III override — profiling always HIGH or above
if (input.profilesUsers === true) {
return { riskLevel: "HIGH", rule: "Annex III override", confidence: 1.0 };
}
return null; // no hard rule matched, proceed to LLM
}
If a hard rule fires, we skip the LLM entirely. The result is deterministic, instantly explainable, and carries a reference to the exact article.
Stage 2: LLM classification with structured output
For cases the hard rules don't cover, we send to Claude with a Zod schema enforcing the output shape.
const ClassificationSchema = z.object({
riskLevel: z.enum(["UNACCEPTABLE", "HIGH", "LIMITED", "MINIMAL"]),
annexIIICategory: z.string().optional(),
confidence: z.number().min(0).max(1),
reasoning: z.string().min(20),
flags: z.array(z.string()),
});
const prompt = buildClassificationPrompt(input); // structured, deterministic prompt
const raw = await callClaude(prompt, { temperature: 0 });
const parsed = JSON.parse(extractJSON(raw)); // strip any prose wrapper
const result = ClassificationSchema.parse(parsed); // throws if invalid
extractJSON handles the common failure mode where Claude wraps JSON in a markdown code block or adds a sentence before the opening brace.
Stage 3: Validation and confidence gating
if (result.confidence < 0.7) {
// Flag for human review rather than returning a definitive answer
await flagForReview(systemId, result, "low_confidence");
return { ...result, requiresReview: true };
}
// Sanity check: if input has profiling signals but LLM returned MINIMAL,
// the pre-filter should have caught it. Something is wrong.
if (input.profilesUsers && result.riskLevel === "MINIMAL") {
throw new ClassificationValidationError(
"LLM contradicts hard rule: profiling system cannot be MINIMAL risk",
{ input, result }
);
}
The full pipeline:
Input
└─ Hard rules (Article 5, Annex III overrides)
└─ Match found → return immediately, confidence=1.0
└─ No match → LLM (Claude Sonnet, temp=0)
└─ Parse JSON → Zod validation
└─ Confidence < 0.7 → flag for review
└─ Sanity checks pass → return result
What we learned
temperature=0 is necessary but not sufficient for determinism. It eliminates sampling variance but the model can still produce structurally different outputs. You need schema validation regardless.
Hard rules are a feature, not a workaround. The EU AI Act has cases where the law is unambiguous. Don't use LLM judgment for those. Encode them explicitly and cite the article.
Confidence thresholds are audit artifacts. When a result gets flagged for review because confidence is 0.62, that flag is a compliance record. Store it.
JSON extraction is its own problem. Build a robust
extractJSONhelper. The model will wrap JSON in code fences, add preamble, occasionally return YAML. Handle all of these before handing off to your parser.Zod enum values are case-sensitive. Your prompt must use the exact strings your schema expects. Document this explicitly in the prompt. We wasted a day on
"High"vs"HIGH".Validate in both directions. Not just "did the LLM return valid JSON?" but "does this result make sense given the inputs?" Cross-checking LLM output against rule-based expectations caught two model regression bugs before they reached users.
What's next
The current pipeline runs inline on the web process. For the next version, we're moving classification to a BullMQ worker so long-running requests don't block the HTTP thread. We're also exploring confidence calibration — checking whether our 0.7 threshold actually correlates with classification accuracy on a labeled test set.
The harder open question: how do you handle legislative updates? The EU AI Act has implementing acts still being written. When Article 6 gets amended, how do you reclassify 500 existing systems without rerunning the LLM for all of them? We don't have a clean answer yet.
Community questions
How do you handle structured output reliability from LLMs in production? Are you using native tool-use / function calling, or prompt engineering + schema validation?
Has anyone built confidence calibration for a domain-specific LLM classifier — and if so, how did you construct your test set?
What's your approach to "legislative drift" — keeping rule-based systems current as the underlying regulation evolves?
This classifier powers Complyance — if you're building AI systems for the EU market, the free classifier is at complyance.app. No account required.
Top comments (0)