Story

Ask HN: Is anyone losing sleep over retry storms or partial API outages?

rjpruitt16 Tuesday, February 03, 2026

I’m working on infrastructure to solve retry storms and outages. Before I go further, I want to understand what people are actually doing today. Compare solutions and maybe help someone see potential solutions. The problems:

Retry storms - API fails, your entire fleet retries independently, thundering herd makes it worse.

Partial outages - API is “up” but degraded (slow, intermittent 500s). Health checks pass, requests suffer.

What I’m curious about: ∙ What’s your current solution? (circuit breakers, queues, custom coordination, service mesh, something else?) ∙ How well does it work? What are the gaps? ∙ What scale are you at? (company size, # of instances, requests/sec)

I’d love to hear what’s working, what isn’t, and what you wish existed.

2 2
Read on Hacker News Comments 2