Show HN: WatchLLM – Semantic caching to cut LLM API costs by 70%
Kaadz Wednesday, December 24, 2025Hey HN! I just shipped WatchLLM - a semantic caching layer for LLM APIs that sits between your app and providers like OpenAI/Claude/Groq.
The problem: LLM API costs add up fast, especially when users ask similar questions in different ways ("how do I reset my password" vs "I forgot my password").
The solution: Semantic caching. WatchLLM vectorizes prompts, checks for similar queries (95%+ similarity), and returns cached responses instantly (50ms). If it's a miss, we forward to the actual API and cache for next time.
Built in 3 days with Node.js, TypeScript, React, Cloudflare Workers (edge deployment), D1, and Redis. Just added prompt normalization today to boost cache hit rates even further.
It's drop-in - literally just change your baseURL and keep using your existing OpenAI/Claude SDKs. No code changes needed.
Currently in beta with a generous free tier (50K requests/month). Would love feedback from anyone building LLM apps - especially on the semantic similarity threshold and normalization strategies.
Live demo on the site shows real-time cache hits and savings.