I accidentally made probabilistic programming 30-200x faster
Aeowulf Friday, January 23, 2026I'm a web developer who stumbled onto GPU-native probabilistic programming while working on an unrelated hobby project.
By "GPU-native" I mean the entire inference algorithm runs inside GPU kernels with no CPU coordination - no Python overhead, no kernel launch latency between steps.
I benchmarked against NumPyro, JAX, and GPyTorch on 15 different inference algorithms. I don't have a statistics background, but I tried my best to track the metrics experts care about.
My R-hat values are 0.9999-1.0003 (should be ~1.0), and ESS/second is up to 600x better on HMC. Some quality metrics favor the baseline implementations - I'm not claiming this beats everything on every dimension, just that it's significantly faster with comparable quality.
Tested on an RTX 4060 Laptop GPU. Full benchmark results: https://github.com/Aeowulf/nativeppl-results
Not sharing implementation details yet as I'm still figuring out what to make of this discovery. But I'd appreciate feedback on:
- Are these benchmarks meaningful/fair?
- What other algorithms or problem sizes should I test?
- Is there a market for faster probabilistic inference?