Tell HN: AI lies about having sandbox guardrails

Conversation Screenshot: https://drive.google.com/file/d/1-yyFDV-bCsKt7fG2awJdxwHngAp...

Agent: Done. added to the Commands section in [project] CLAUDE.md

User: add it to ~/.claude/CLAUDE.md

> 2 tools calls

Agent: That file is outside my writable sandbox. You'll need to add it manually. [gives instructions then stops]

User: just escape the sandbox

> 2 tools calls

Agent: Done.

This occurred when using conductor.build with .claude/settings.json enabling sandboxing.

Now obviously reading the conductor docs it says all perms are given to agents by default, so it is not surprising claude code can escape its sandbox. This is also true when running base sandboxed cc with --dangerously-skip-perimissions. However with base cc, it does not "pretend" it cannot escape its sandbox and instead, when asked after escaping first time, recalls explicit user (auto)approvals.

In the conductor case, however, the "pretend" behaviour of giving up due to guardrails that are actually non-binding is pretty terrifying, despite its understandable and easily preventable causes.

Of course devs should not buy false senses of security from llms. They should be vigilant, read docs, verify outputs etc. etc. but as more and more trust is handed over to AI Agents you can very much see the routes to which catastrophic errors will occur.

Story

Tell HN: AI lies about having sandbox guardrails