Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0)

AI coding agents generate decent code. The problem is everything around the code - checking progress, catching drift, deciding if it's actually done. I spent months trying to make autonomous agents work. The bottleneck was always me.

Attempt 1 - Claude/GPT directly: works for small stuff, but you re-explain context endlessly.

Attempt 2 - Copilot/Cursor: great autocomplete, still doing 95% of the thinking.

Attempt 3 - continuous agents: keeps working without prompting, but "no errors" doesn't mean "feature works."

Attempt 4 - parallel agents: faster wall-clock, but now you're manually reviewing even more output.

The common failure: nobody verifies whether the output satisfies the goal. That somebody was always me. So I automated that job.

OmoiOS is a spec-driven orchestration system. You describe a feature, and it:

1. Runs a multi-phase spec pipeline (Explore > Requirements > Design > Tasks) with LLM evaluators scoring each phase. Retry on failure, advance on pass. By the time agents code, requirements have machine-checkable acceptance criteria.

2. Spawns isolated cloud sandboxes per task. Your local env is untouched. Agents get ephemeral containers with full git access.

3. Validates continuously - a separate validator agent checks each task against acceptance criteria. Failures feed back for retry. No human in the loop between steps.

4. Discovers new work - validation can spawn new tasks when agents find missing edge cases. The task graph grows as agents learn.

What's hard (honest):

- Spec quality is the bottleneck. Vague spec = agents spinning. - Validation is domain-specific. API correctness is easy. UI quality is not. - Discovery branching can grow the task graph unexpectedly. - Sandbox overhead adds latency per task. Worth it, but a tradeoff. - Merging parallel branches with real conflicts is the hardest problem. - Guardian monitoring (per-agent trajectory analysis) has rough edges still.

Stack: Python/FastAPI, PostgreSQL+pgvector, Redis (~190K lines). Next.js 15 + React Flow (~83K lines TS). Claude Agent SDK + Daytona Cloud. 686 commits since Nov 2025, built solo. Apache 2.0.

I keep coming back to the same problem: structured spec generation that produces genuinely machine-checkable acceptance criteria. Has anyone found an approach that works for non-trivial features, or is this just fundamentally hard?

GitHub: https://github.com/kivo360/OmoiOS Live: https://omoios.dev

Story

Show HN: OmoiOS–190K lines of Python to stop babysitting AI agents (Apache 2.0)