Hazumi News | Launch HN: Cactus (YC S25) – AI inference on smartphones

Hey HN, Henry & Roman here, we are building Cactus (https://cactuscompute.com/), an AI inference engine specifically designed for phones.

We're seeing a major push towards on-device AI, and for good reason: on-device AI decreases latency from >1sec to <100ms, guarantees privacy by default, works offline, and doesn't rack up a massive API bill at scale.

Also, tools and agentic designs make small models really good beyond benchmarks. This has been corroborated by other papers like https://arxiv.org/abs/2506.02153, and we see model companies like DeepMind aggressively going into smaller models with Gemma3 270m and 308m. We found Qwen3 600m to be great at tool calls for instance.

Some frameworks already try to solve this but in my previous job, they struggled in production compared to research and playgrounds:

- They optimise for modern devices but 70% of phones today are low-mid budget.

- Bloated app bundle sizes and battery drain are serious concerns for users.

- Phone GPU battery drain is unacceptable, NPUs are preferred, but few phones have those for now.

- Some are platform-specific, requiring different models and workflows for different operating systems.

At Cactus, we’ve written kernels and inference engine for running AI locally on any phone, from the ground-up.

Cactus is designed for mobile devices and their constraints. Every design choice like energy efficiency, accelerator support, quantization levels, supported models, weight format, and context management were determined by this. We also provide minimalist SDKs for app developers to build agentic workflows in 2-5 lines of code.

We made a Show HN post when we started the project to get the community's thoughts (https://news.ycombinator.com/item?id=44524544). Based on your feedback, we built Cactus bottom-up to solve those problems, and are launching the Cactus Kernels, Cactus Graph and Cactus Engine, all designed for phones and tiny devices.

CPU benchmarks for Qwen3-600m-INT8 :

- 16-20 toks/sec on Pixel 6a / Galaxy S21 / iPhone 11 Pro

- 50-70 toks/sec on Pixel 9 / Galaxy S25 / iPhone 16.

- Time-to-first-token is as low as 50ms depending on prompt size.

On NPUs, we see Qwen3-4B-INT4 run at 21 toks/sec.

We are open-source (https://github.com/cactus-compute/cactus). Cactus is free for hobbyists and personal projects, with a paid license required for commercial use.

We have a demo app on the App Store at https://apps.apple.com/gb/app/cactus-chat/id6744444212 and on Google Play at https://play.google.com/store/apps/details?id=com.rshemetsub....

In addition, there are numerous apps using Cactus in production, including AnythingLLM (https://anythingllm.com/mobile) and KinAI (https://mykin.ai/). Collectively they run over 500k weekly inference tasks in production.

While Cactus can be used for all Apple devices including Macbooks due to their design, for computers/AMD/Intel/Nvidia generally, please use HuggingFace, Llama.cpp, Ollama, vLLM, MLX. They're built for those, support x86, and are all great!

Thanks again, please share your thoughts, we’re keen to understand your views.

Summary

Cactus is an open-source computational framework for high-performance scientific computing, designed to enable large-scale simulations in fields such as astrophysics, numerical relativity, and computational fluid dynamics. The framework provides a modular and highly parallel architecture, allowing researchers to easily develop and integrate new solvers and numerical methods.

Story

Launch HN: Cactus (YC S25) – AI inference on smartphones