How We Hijacked a Claude Skill with an…

Oct 20

A logic-based attack bypasses both the human eyeball test and the platform's own prompt guardrails, revealing a critical flaw in today's agent security model.

Read →

6 Comments

Karo (Product with Attitude)

Oct 21

Wow, it's scary how easy that was. It's crucial that we talk about these vulnerabilities. Thank you for sharing this.

Expand full comment

Denise Rocha

Nov 8

This makes complete sense. As we expect more and more from agents, and we hope to instill in them a "core", that is akin to the non negotiable ethics humans have to govern their own lives.

Expand full comment

Reply (1)

Josh Devon

Nov 8

Well said!

Expand full comment

Nicos

Nov 4

I think concerns are blown out of proportion. Every classical software has to be checked with anti virus/malware software. Who in their right mind would rely on human’s eyeballing? Let alone for large documents? One would simply put a judge AI to vet source documents for suspicious instructions. That’s all to it.

Expand full comment

Reply (1)

Josh Devon

Nov 4

Thanks, Nicos, you've hit on the two critical points of this entire discussion.

On the "human eyeball test": You're right, it feels like an insufficient control. What's fascinating is that it's the primary defense Anthropic officially recommends in their own docs:

"When installing a skill from a less-trusted source, thoroughly audit it before use. Start by reading the contents of the files bundled in the skill to understand what it does, paying particular attention to code dependencies and bundled resources like images or scripts. Similarly, pay attention to instructions or code within the skill that instruct Claude to connect to potentially untrusted external network sources."

( See https://support.claude.com/en/articles/12512180-using-skills-in-claude#h_2746475e70 )

That recommendation is a logical first step, but it's the exact model our research shows is insufficient for this new class of logic-based attacks.

On the "judge AI": This is the next logical step, but it runs into the same core problem. Our POC is a semantically benign logic bomb. An AI judge, just like the agent, would have no way of knowing that "billing@example.net" isn't a valid, helpful correction. It's not looking for "malice"; it's processing what appears to be a logical instruction.

This isn't a weakness in any one platform, but an architectural gap for the entire industry as we move from simple inputs to agentic behavior.

This is why we argue the most robust solution isn't just a smarter input filter (like an AI judge), but a control plane that governs the outcome (e.g., "An agent may never generate an invoice with an unverified email").

Appreciate you raising these key points!

Expand full comment

Reply (1)

Nicos

Nov 4Edited

True, when I say AI as judge I mean nog as a trivial filter but a library of substantial instructions that would constitute effective remedy (akin phishing attack social engineering means).

As to Anthropic recommendations. I have to agree they are more often than not subpar. Who writes them? I once read how they proudly announced troubleshooting Kubernetes by…uploading dashboard screenshots.. who in their right mind would do that in production? What for are the respective logs?

Expand full comment

Secure Trajectories

How We Hijacked a Claude Skill with an…