Hazumi News | Show HN: Treni – single-binary GPU runtime for uncertainty-aware agents 5ms TTFT

We built Treni, a C/CUDA runtime where routing, tokenization, tool models, and state run in one GPU process.

Most agent stacks only get serialized tool strings back. Treni exposes execution signals in-process (entropy/logprobs/retrieval distance/route confidence), so the agent can branch before committing bad output.

Canonical A10G (G5), token-parity vs vLLM (max_tokens=48):

TTFT: 5.130 ms (Treni) vs 84.837 ms (vLLM) -> 16.537x Full request: 316.403 ms vs 1232.660 ms -> 3.896x Cold total first response: 1320.240 ms vs 28937.430 ms -> 21.918x Steady state:

Warm mean: 80.602 ms Warm p99: 90.350 ms Additional checks:

Frontend A/B repeatability (warm_fixed + mixed_churn, repeats=3): custom path wins all tracked metrics Numerical parity vs PyTorch (strict mode): 0 failures Separate OpenAI routing-overhead test (different question, not engine-vs-engine):

Same model endpoint on both sides (gpt-5.2) Internal path: client -> OpenAI External path: client -> controller/tool hop -> same OpenAI endpoint Fairness-hardened local controls (runs=8): model-only: near parity (int = 0.971x) tool-only: external slower (int = 1.038x) Docs + raw artifacts:

https://treni-docs.pages.dev/docs/ https://treni-docs.pages.dev/docs/objectives-and-thesis https://treni-docs.pages.dev/docs/leaderboard https://treni-docs.pages.dev/docs/trackb-claim-safe-table https://treni-docs.pages.dev/docs/raw-artifacts

Summary

The article provides an overview of the Treni documentation, which covers the Treni API and how to use it to build applications that interact with train schedules and related data. It includes information on authentication, making requests, and the various endpoints available through the API.

Story

Show HN: Treni – single-binary GPU runtime for uncertainty-aware agents 5ms TTFT