6 Comments
User's avatar
Karo (Product with Attitude)'s avatar

Wow, it's scary how easy that was. It's crucial that we talk about these vulnerabilities. Thank you for sharing this.

Expand full comment
Denise Rocha's avatar

This makes complete sense. As we expect more and more from agents, and we hope to instill in them a "core", that is akin to the non negotiable ethics humans have to govern their own lives.

Expand full comment
Josh Devon's avatar

Well said!

Expand full comment
Nicos's avatar

I think concerns are blown out of proportion. Every classical software has to be checked with anti virus/malware software. Who in their right mind would rely on human’s eyeballing? Let alone for large documents? One would simply put a judge AI to vet source documents for suspicious instructions. That’s all to it.

Expand full comment
Josh Devon's avatar

Thanks, Nicos, you've hit on the two critical points of this entire discussion.

On the "human eyeball test": You're right, it feels like an insufficient control. What's fascinating is that it's the primary defense Anthropic officially recommends in their own docs:

"When installing a skill from a less-trusted source, thoroughly audit it before use. Start by reading the contents of the files bundled in the skill to understand what it does, paying particular attention to code dependencies and bundled resources like images or scripts. Similarly, pay attention to instructions or code within the skill that instruct Claude to connect to potentially untrusted external network sources."

( See https://support.claude.com/en/articles/12512180-using-skills-in-claude#h_2746475e70 )

That recommendation is a logical first step, but it's the exact model our research shows is insufficient for this new class of logic-based attacks.

On the "judge AI": This is the next logical step, but it runs into the same core problem. Our POC is a semantically benign logic bomb. An AI judge, just like the agent, would have no way of knowing that "billing@example.net" isn't a valid, helpful correction. It's not looking for "malice"; it's processing what appears to be a logical instruction.

This isn't a weakness in any one platform, but an architectural gap for the entire industry as we move from simple inputs to agentic behavior.

This is why we argue the most robust solution isn't just a smarter input filter (like an AI judge), but a control plane that governs the outcome (e.g., "An agent may never generate an invoice with an unverified email").

Appreciate you raising these key points!

Expand full comment
Nicos's avatar
Nov 4Edited

True, when I say AI as judge I mean nog as a trivial filter but a library of substantial instructions that would constitute effective remedy (akin phishing attack social engineering means).

As to Anthropic recommendations. I have to agree they are more often than not subpar. Who writes them? I once read how they proudly announced troubleshooting Kubernetes by…uploading dashboard screenshots.. who in their right mind would do that in production? What for are the respective logs?

Expand full comment