Question 1

What size LLM can run on a smartphone?

Accepted Answer

Modern flagships with 8-12 GB RAM comfortably run 7B parameter models at INT4 quantization. Mid-range phones with 6 GB handle 3B models well. Some flagships can fit 13B models, though generation speed may be slower. Cactus and llama.cpp both support the necessary quantization levels.

Question 2

Is on-device LLM quality comparable to ChatGPT or Claude?

Accepted Answer

On-device models are smaller and less capable than cloud frontier models. A 7B model is roughly comparable to GPT-3.5 for many tasks. For focused use cases like summarization, code completion, or structured extraction, on-device models perform well. Cactus bridges the gap with hybrid routing to cloud models when quality is critical.

Question 3

Which quantization format should I use for on-device LLMs?

Accepted Answer

GGUF is the most widely supported format, used by Cactus and llama.cpp. INT4 (Q4_K_M) offers the best size-to-quality ratio for mobile. INT8 provides higher quality at double the memory cost. ExecuTorch uses PyTorch's native quantization. MLC LLM has its own compiled format.

Question 4

How fast is on-device LLM inference?

Accepted Answer

On recent flagship phones, expect 10-30 tokens per second for 7B models at INT4. Cactus achieves sub-120ms time-to-first-token. Desktop Apple Silicon runs faster at 30-60+ tokens per second. Performance varies significantly by device, model size, context length, and quantization level.

Question 5

Can on-device LLMs do function calling and tool use?

Accepted Answer

Yes. Cactus supports grammar-constrained generation and structured function calling. llama.cpp offers grammar-based output constraints. Some models like Qwen 3 have built-in tool-use training. This enables agentic applications that call APIs and process results entirely on-device.

Question 6

Do on-device LLMs work offline?

Accepted Answer

Completely. Once model weights are downloaded, inference requires zero network connectivity. Cactus, llama.cpp, MLC LLM, and ExecuTorch all run fully offline. Cactus optionally uses its hybrid routing when connectivity is available, but this is not required.

Question 7

What models work best for on-device LLM inference?

Accepted Answer

Gemma 3 2B and 4B, Qwen 3 1.5B and 4B, Phi-3 Mini, and LFM2 are currently the top performers at mobile-friendly sizes. Gemma 3 4B offers strong instruction-following in a compact package. Cactus ships with optimized configurations for these models across platforms.

Question 8

How do I stream LLM tokens to the UI?

Accepted Answer

Cactus provides token-level streaming with SSE across all SDKs. llama.cpp exposes a callback-based streaming API. On mobile, run inference on a background thread and push tokens to the UI layer as they generate to maintain smooth scrolling and responsiveness.

Best On-Device LLM Framework in 2026: Complete Guide

Feature comparison

What to Look for in an On-Device LLM Framework

1. Cactus

2. llama.cpp

3. MLC LLM

4. ExecuTorch

The Verdict

Frequently asked questions

Try Cactus today