Show HN: Treni – single-binary GPU runtime for uncertainty-aware agents 5ms TTFT
andrewmonostate Monday, February 23, 2026We built Treni, a C/CUDA runtime where routing, tokenization, tool models, and state run in one GPU process.
Most agent stacks only get serialized tool strings back. Treni exposes execution signals in-process (entropy/logprobs/retrieval distance/route confidence), so the agent can branch before committing bad output.
Canonical A10G (G5), token-parity vs vLLM (max_tokens=48):
TTFT: 5.130 ms (Treni) vs 84.837 ms (vLLM) -> 16.537x Full request: 316.403 ms vs 1232.660 ms -> 3.896x Cold total first response: 1320.240 ms vs 28937.430 ms -> 21.918x Steady state:
Warm mean: 80.602 ms Warm p99: 90.350 ms Additional checks:
Frontend A/B repeatability (warm_fixed + mixed_churn, repeats=3): custom path wins all tracked metrics Numerical parity vs PyTorch (strict mode): 0 failures Separate OpenAI routing-overhead test (different question, not engine-vs-engine):
Same model endpoint on both sides (gpt-5.2) Internal path: client -> OpenAI External path: client -> controller/tool hop -> same OpenAI endpoint Fairness-hardened local controls (runs=8): model-only: near parity (int = 0.971x) tool-only: external slower (int = 1.038x) Docs + raw artifacts:
https://treni-docs.pages.dev/docs/ https://treni-docs.pages.dev/docs/objectives-and-thesis https://treni-docs.pages.dev/docs/leaderboard https://treni-docs.pages.dev/docs/trackb-claim-safe-table https://treni-docs.pages.dev/docs/raw-artifacts