Testing deterministic code is straightforward: give it an input, check the output. Testing an autonomous agent is a different problem. When agents graded their own work, they were wrong nearly 1 in 5 times and missed 40% of actual failures. 👀 Here's how we built an independent "Trust Layer" that can hit 100% accuracy, without brittle scripts or black-box judgments. 🔍 https://lnkd.in/dKqAhp7P
"We don’t need black-box models to judge other black-box models." 👆 You just described human judgement. We are literally building agents in our image. How do we validate human output?
GitHub’s latest Trust Layer work confirms the exact problem Keystone was built to solve: AI and automation cannot be trusted by self-report. Keystone attacks the pipeline-proof side of that problem with deterministic, replayable, tamper-evident verification
Literally a dumb idea. You cannot be both probabilistic and 100 accurate.
For stochastic agents, reliability cannot be grounded in stated intent; it has to be grounded in inspection that runs independently of the actor’s permission.
The agent didn't fail. The validation did. That reframe is the whole paper in one sentence. This problem exists because two years of marketing structured workflows as deterministic. They're not. They're probabilistic inference with a deterministic-looking shell. The testing framework failing on a successful run is what happens when you build validation on path-adherence instead of outcome-truth. Dominator analysis is the honest fix, define what must be true for success to be real, ignore the rest. More tooling should reason this way at the architecture level, not just the test layer.Been solving the same problem at the memory layer in an open-source agent runtime, separating essential facts from incidental state using the same conceptual structure: https://github.com/theshovonsaha/shovsOS
🙏 GitHub


One of the hardest shifts in AI is realizing that generation and governance are different problems. Models can produce an answer. Determining whether that answer should be trusted, acted on, or escalated is a separate layer entirely. As systems become more autonomous, validation may become more important than generation.