🎯 Practical: Testing deterministic code is straightforward: give it an input, check the output. Testing an autonomous agent is a different problem. When agents graded their own work, they were wrong nearly 1 in 5… | GitHub

6,203,114 followers

Testing deterministic code is straightforward: give it an input, check the output. Testing an autonomous agent is a different problem. When agents graded their own work, they were wrong nearly 1 in 5 times and missed 40% of actual failures. 👀 Here's how we built an independent "Trust Layer" that can hit 100% accuracy, without brittle scripts or black-box judgments. 🔍 https://lnkd.in/dKqAhp7P

9 Comments

Jaci Turner 12m

One of the hardest shifts in AI is realizing that generation and governance are different problems. Models can produce an answer. Determining whether that answer should be trusted, acted on, or escalated is a separate layer entirely. As systems become more autonomous, validation may become more important than generation.

Adriana Garcia 18m

"We don’t need black-box models to judge other black-box models." 👆 You just described human judgement. We are literally building agents in our image. How do we validate human output?

Robert Lazaravitch 2h

GitHub’s latest Trust Layer work confirms the exact problem Keystone was built to solve: AI and automation cannot be trusted by self-report. Keystone attacks the pipeline-proof side of that problem with deterministic, replayable, tamper-evident verification

Mohammad Namakshenas 2h

Literally a dumb idea. You cannot be both probabilistic and 100 accurate.

7 Reactions

Warren Carrigan 1h

For stochastic agents, reliability cannot be grounded in stated intent; it has to be grounded in inspection that runs independently of the actor’s permission.

Shovon S. 1h

The agent didn't fail. The validation did. That reframe is the whole paper in one sentence. This problem exists because two years of marketing structured workflows as deterministic. They're not. They're probabilistic inference with a deterministic-looking shell. The testing framework failing on a successful run is what happens when you build validation on path-adherence instead of outcome-truth. Dominator analysis is the honest fix, define what must be true for success to be real, ignore the rest. More tooling should reason this way at the architecture level, not just the test layer.Been solving the same problem at the memory layer in an open-source agent runtime, separating essential facts from incidental state using the same conceptual structure: https://github.com/theshovonsaha/shovsOS

Guclu Kazan 2h

🙏 GitHub

See more comments

To view or add a comment, sign in

Apr	MAY	Jun
	30
2025	2026	2027

GitHub’s Post

More from this author

My magic moment using GitHub Copilot CLI

Your workflow doesn’t live in one place anymore

Coding still matters. But it’s no longer what separates skilled developers.

Explore content categories