Hazumi News | Show HN: WatchLLM – Semantic caching to cut LLM API costs by 70%

Hey HN! I just shipped WatchLLM - a semantic caching layer for LLM APIs that sits between your app and providers like OpenAI/Claude/Groq.

The problem: LLM API costs add up fast, especially when users ask similar questions in different ways ("how do I reset my password" vs "I forgot my password").

The solution: Semantic caching. WatchLLM vectorizes prompts, checks for similar queries (95%+ similarity), and returns cached responses instantly (50ms). If it's a miss, we forward to the actual API and cache for next time.

Built in 3 days with Node.js, TypeScript, React, Cloudflare Workers (edge deployment), D1, and Redis. Just added prompt normalization today to boost cache hit rates even further.

It's drop-in - literally just change your baseURL and keep using your existing OpenAI/Claude SDKs. No code changes needed.

Currently in beta with a generous free tier (50K requests/month). Would love feedback from anyone building LLM apps - especially on the semantic similarity threshold and normalization strategies.

Live demo on the site shows real-time cache hits and savings.

Story

Show HN: WatchLLM – Semantic caching to cut LLM API costs by 70%