Story

Show HN: CRTX – AI code gen that tests and fixes its own output (OSS)

johnnycash926 Saturday, February 21, 2026

We built an open-source CLI that generates code, runs tests, fixes failures, and gets an independent AI review — all before you see the output. We started with a multi-model pipeline where different AI models handled different stages (architect, implement, refactor, verify). We assumed more models meant better code. Then we benchmarked it: 39% average quality score at $4.85 per run. A single model scored 94% at $0.36. Our pipeline was actively making things worse. So we killed it and rebuilt around what developers actually do when they get AI-generated code: run it, test it, fix what breaks. The Loop generates code, runs pytest automatically, feeds failures back for targeted fixes, and repeats until all tests pass. Then an independent Arbiter (always a different model than the generator) reviews the final output. Latest benchmark across three tasks (simple CLI, REST API, async multi-agent system): Single Sonnet: 94% avg, 10 min dev time, $0.36 Single o3: 81% avg, 4 min dev time, $0.44 Multi-model: 88% avg, 9 min dev time, $5.59 CRTX Loop: 99% avg, 2 min dev time, $1.80 "Dev time" estimates how long a developer would spend debugging the output before it's production-ready. The Loop's hardest prompt produced 127 passing tests with zero failures. When the Loop hits a test it can't fix, it has a three-tier escalation: diagnose the root cause before patching, strip context to just the failing test and source file, then bring in a different model for a second opinion. The goal is zero dev time on every run. Model-agnostic — works with Claude, GPT, o3, Gemini, Grok, DeepSeek. Bring your own API keys. Apache 2.0. pip install crtx https://github.com/CRTXAI/crtx We published the benchmark tool too — run crtx benchmark --quick to reproduce our results with your own keys. Curious what scores people get on different providers and tasks.

Summary
CRTX is an open-source project that aims to provide a comprehensive toolkit for clinical research and translational studies, including tools for data management, analysis, and visualization. The project is designed to support researchers and clinicians in various aspects of their work, from data collection to reporting and collaboration.
1 0
Summary
github.com
Visit article Read on Hacker News