Show HN: 4x faster Deep Learning training – we replaced the DataLoader with Rust
yansoki Tuesday, January 20, 2026Hi HN, We built a drop-in replacement for torch.utils.data.DataLoader in Rust.
The Problem: Python's multiprocessing isolates workers, meaning every batch incurs IPC and pickling overhead. Even on a T4, the CPU bottlenecks and the GPU sits idle.
The Solution: We bypass Python's data plane entirely. Rust Backend: Uses native threads (no GIL, no heavy process forking). Zero-Copy: Memory-mapped custom format (.kt) creates views into tensors without deserialization.
Benchmarks (ResNet-18 / ImageWoof, Tesla T4, batch=64): Loader Throughput Speedup (Time taken per epoch/Total number of images) PyTorch ImageFolder 116 img/s 1.0x MosaicML Streaming 179 img/s 1.5x NVIDIA DALI 246 img/s 2.1x Kuattree (Ours) 512 img/s 4.4x
Compared to DALI, we are 2.08x faster. Compared to PyTorch, we are 4.4x faster. You have to pre-convert your dataset to .kt. It is similar to writing a TFRecord or WebDataset, but designed for random access, and 60x faster than MosaicML sharding.
Not open source yet, but we're running a private beta if you want to verify on your hardware.
Happy to answer questions.