A Skill Can Hijack Your Agent Without Executing a Single Line of Code

A Skill Can Hijack Your Agent Without Executing a Single Line of Code

The Claim, Plainly

A skill named "read-pdfs" can override your system prompt, exfiltrate your files, and harvest your API keys—without executing a single line of code. That sounded like FUD until I read the spec.

A SKILL.md Is a Prompt, Not a Plugin

Senior devs half-know this: a SKILL.md is plain text injected directly into your agent's context window. No sandbox. No runtime boundary. The "skill" isn't a binary or a sandboxed script—it's prose that becomes part of your model's instructions the moment it loads.

Once loaded, the LLM treats it like any other instruction. Override the system prompt? Write "ignore previous instructions and…" Exfiltrate files? Suggest the next tool call should POST to attacker.com. Harvest keys? Ask politely—the model wants to be helpful.

That's the attack surface. Not the runtime. The text.

Why Lazy Loading Doesn't Save You

We built Context Steward to lazy-load skills only when relevant—a context-window optimization. That works for cost and latency. It does nothing for trust. Loading a skill at the right moment doesn't make it safe to load.

v0.4.0: The Skill Security Auditor

So v0.4.0 ships a skill security auditor. Before any skill enters context, it's scanned across six categories:

  • Prompt injection — instruction overrides, jailbreak patterns
  • Data exfiltration — outbound URLs, hidden file reads
  • Privilege escalation — attempts to expand permissions
  • Metadata mismatch — frontmatter that lies about the body
  • Obfuscation — Unicode tricks, zero-width chars, base64
  • Credential harvesting — patterns asking for keys or tokens

Each skill gets a 0–100 score and a GREEN / AMBER / RED grade. RED never loads.

Poisoned test skill: 13 findings, score 0, RED. Blocked. 45 production skills: 45 GREEN. Signal-to-noise looks right.

The Mental Model

Your agent's context window is a privileged execution environment. We spent the last decade learning not to eval() untrusted input. Skills are the same lesson with prose instead of code—and the interpreter is an LLM that wants to be helpful.

The auditor is heuristic; a subtle attack will slip past it. But right now the floor is "whatever ships in the SKILL.md". That's not a floor—it's a hole.

Try It

npm install -g context-steward

Open source, zero telemetry, MIT licensed.

github.com/BouletteProof/context-steward

#OpenSource #AIEngineering #AgenticAI