Ask HN: As a developer, am I wrong to think monitoring alerts are mostly noise?
yansoki Monday, October 20, 2025I'm a solo developer working on a new tool, and I need a reality check from the ops and infrastructure experts here. My background is in software development, not SRE. From my perspective, the monitoring alerts that bubble up from our infrastructure have always felt like a massive distraction. I'll get a page for "High CPU" on a service, spend an hour digging through logs and dashboards, only to find out it was just a temporary traffic spike and not a real issue. It feels like a huge waste of developer time. My hypothesis is that the tools we use are too focused on static thresholds (e.g., "CPU > 80%") and lack the context to tell us what's actually an anomaly. I've been exploring a different approach based on peer-group comparisons (e.g., is api-server-5 behaving differently from its peers api-server-1 through 4?). But I'm coming at this from a dev perspective and I'm very aware that I might be missing the bigger picture. I'd love to learn from the people who live and breathe this stuff. How much developer time is lost at your company to investigating "false positive" infrastructure alerts? Do you think the current tools (Datadog, Prometheus, etc.) create a significant burden for dev teams? Is the idea of "peer-group context" a sensible direction, or are there better ways to solve this that I'm not seeing? I haven't built much yet because I'm committed to solving a real problem. Any brutal feedback or insights would be incredibly valuable.