Ask HN: How do you do store-and-forward telemetry at the edge?

I’m researching patterns for edge / gateway telemetry where the network is unreliable (remote sites, industrial, fleets, etc.) and you need offline buffering + bounded disk + replay once connectivity returns.

Questions for folks running this in production:

What do you use today? (MQTT broker + ??, Kafka/Redpanda/NATS, Redis Streams, custom log files, embedded DB, etc.)

Where do you buffer during outages: append-only log, SQLite/RocksDB, queue-on-disk, something else?

How do you handle backpressure when disk is near full? (drop policy, compression, sampling, prioritization)

What’s your failure nightmare: corruption, replay storms, duplicates, “stuck” consumer offsets, disk-full, clock skew?

What guarantees do you actually need: zero-loss vs “best effort” (and where do you draw that line)?

What metrics/alerts matter most on gateways? (queue depth, replay rate, oldest event age, fsync latency, disk usage, etc.)

I’d love to learn what works, what breaks, and what you wish existing tools did better.

Story

Ask HN: How do you do store-and-forward telemetry at the edge?