Hazumi News | Show HN: OpenGraviton – Run 500B+ parameter models on a consumer Mac Mini

Hi HN,

I built OpenGraviton, an open-source AI inference engine designed to push the limits of running extremely large models on consumer hardware.

The system combines several techniques to drastically reduce memory and compute requirements:

• 1.58-bit ternary quantization ({-1, 0, +1}) for ~10x compression • dynamic sparsity with Top-K pruning and MoE routing • mmap-based layer streaming to load weights directly from NVMe SSDs • speculative decoding to improve generation throughput

These allow models far larger than system RAM to run locally.

In early benchmarks, OpenGraviton reduced TinyLlama-1.1B from ~2.05GB (FP16) to ~0.24GB using ternary quantization. Synthetic stress tests at the 140B scale show that models which would normally require ~280GB FP16 can fit within ~35GB when packed with the ternary format.

The project is optimized for Apple Silicon and currently uses custom Metal + C++ tensor unpacking.

Benchmarks, architecture, and details: https://opengraviton.github.io

GitHub: https://github.com/opengraviton

Story

Show HN: OpenGraviton – Run 500B+ parameter models on a consumer Mac Mini