Hazumi News | Show HN: Yardstiq – Compare LLM outputs side-by-side in your terminal

Hey HN,

I built yardstiq because I got tired of the copy-paste workflow for comparing LLM responses when developing apps. Every time I wanted to see how Claude vs GPT vs Gemini handled the same prompt, I'd open three tabs, paste the same thing, and try to eyeball the differences. It's 2026 and we have 40+ models worth considering — that doesn't scale.

yardstiq is a CLI tool that sends one prompt to multiple models simultaneously and streams the responses side-by-side in your terminal. It also tracks performance metrics (time to first token, tokens/sec, cost) and optionally runs an AI judge to score the outputs.

``` npx yardstiq "Explain quicksort in 3 sentences" -m claude-sonnet -m gpt-4o ```

What it does:

- Streams responses from multiple models in parallel, rendered in columns - Shows TTFT, throughput (tok/s), token counts, and cost per request - AI judge mode: have a model evaluate and score the responses - Export to JSON, Markdown, or self-contained HTML reports - Run YAML-defined benchmark suites across models with aggregate scoring - Works with Ollama for local model comparisons (zero API cost) - Supports 40+ models via direct provider keys or Vercel AI Gateway

I built this mostly for my own workflow — picking models for different tasks, testing prompt variations, and running quick benchmarks without setting up a whole evaluation framework. It's not trying to replace serious eval platforms, just make the "which model is better for X?" question answerable in 10 seconds.

MIT licensed, written in TypeScript: https://github.com/stanleycyang/yardstiq

Happy to answer questions about the architecture or benchmarking approach.

Story

Show HN: Yardstiq – Compare LLM outputs side-by-side in your terminal