Show HN: Zagora, Distributed fine-tuning platform on mixed GPUs over internet

I built Zagora, a distributed fine-tuning platform that turns fragmented or mixed GPUs into a unified training cluster over standard internet (1Gbps).

The problem:

Most distributed training assumes homogeneous GPUs and high-bandwidth interconnects (NVLink/InfiniBand). On heterogeneous fleets over standard internet, tensor/data parallel approaches become communication-bound and fragile.

What Zagora does under the hood:

- Uses pipeline-style parallelism instead of heavy tensor synchronization.

- Passes only boundary activations between stages rather than full parameter sync.

- Assigns layers proportionally to GPU capability to reduce straggler idle time.

- Uses checkpoint-based recovery to tolerate worker crashes.

- Supports adapter-based fine-tuning (e.g., QLoRA) to reduce memory pressure.

Zagora currently supports managed runs (we provision GPUs in-region) and a BYOC mode where users run workers on their own infrastructure.

Limitations:

- Full-parameter fine-tuning is not supported yet.

- It won't beat an NVLink cluster on raw throughput.

- Cross-region training is still latency-sensitive.

- Heterogeneous nodes scheduling is an ongoing tuning problem.

IMPORTANT:

I'm currently running jobs manually, so it may take some time before training starts. However, I will run every submitted job.

Link: app.zagora.ai

I'd be interested in feedback from people who've worked on distributed training at scale.

Happy to answer technical questions.

Summary

Zagora AI is a platform that helps businesses streamline their AI development and deployment processes. It offers tools for data management, model training, and model deployment, empowering companies to build and integrate AI solutions more efficiently.

Story

Show HN: Zagora, Distributed fine-tuning platform on mixed GPUs over internet