I accidentally made probabilistic programming 30-200x faster

I'm a web developer who stumbled onto GPU-native probabilistic programming while working on an unrelated hobby project.

By "GPU-native" I mean the entire inference algorithm runs inside GPU kernels with no CPU coordination - no Python overhead, no kernel launch latency between steps.

I benchmarked against NumPyro, JAX, and GPyTorch on 15 different inference algorithms. I don't have a statistics background, but I tried my best to track the metrics experts care about.

My R-hat values are 0.9999-1.0003 (should be ~1.0), and ESS/second is up to 600x better on HMC. Some quality metrics favor the baseline implementations - I'm not claiming this beats everything on every dimension, just that it's significantly faster with comparable quality.

Tested on an RTX 4060 Laptop GPU. Full benchmark results: https://github.com/Aeowulf/nativeppl-results

Not sharing implementation details yet as I'm still figuring out what to make of this discovery. But I'd appreciate feedback on:

- Are these benchmarks meaningful/fair?

- What other algorithms or problem sizes should I test?

- Is there a market for faster probabilistic inference?

Story

I accidentally made probabilistic programming 30-200x faster