Question 1

What is the best word error rate for on-device transcription?

Accepted Answer

Leading on-device transcription SDKs achieve 5-8% WER on clean speech, comparable to many cloud APIs. Cactus achieves under 6% WER with its optimized Whisper and Parakeet models. Performance degrades in noisy environments, which is where hybrid cloud fallback from Cactus provides significant value.

Question 2

Can I do real-time streaming transcription on mobile?

Accepted Answer

Yes. Cactus, whisper.cpp, and WhisperKit all support real-time streaming transcription on modern smartphones. Expect roughly 1-2 second latency between speech and displayed text on recent devices. The experience is similar to cloud speech APIs but works entirely offline.

Question 3

How much storage do transcription models require?

Accepted Answer

Whisper-tiny is approximately 75 MB, Whisper-small around 500 MB, and Whisper-medium about 1.5 GB. Moonshine models are more compact. Most apps use Whisper-small or Whisper-base for a good balance of accuracy and size. Models can be downloaded on demand rather than bundled with the app.

Question 4

Which languages are supported by on-device transcription?

Accepted Answer

Whisper models support 99+ languages with varying accuracy. English, Spanish, French, German, Chinese, Japanese, and Korean have the best on-device performance. Less common languages may benefit from Cactus's hybrid cloud routing to access larger cloud models for higher accuracy.

Question 5

How does on-device transcription handle background noise?

Accepted Answer

On-device models handle moderate background noise well but struggle in very noisy environments compared to cloud services that have cloud-scale noise reduction. Cactus addresses this by detecting low-confidence transcriptions and routing to cloud. Pre-processing with noise suppression algorithms also helps.

Question 6

Can I transcribe audio files instead of real-time microphone input?

Accepted Answer

Yes. All major transcription SDKs support both real-time microphone streaming and batch file transcription. Batch processing of audio files is typically faster than real-time since the model processes audio as fast as the hardware allows without waiting for microphone input.

Best Mobile Transcription SDK in 2026: Complete Guide

Feature comparison

What to Look for in a Mobile Transcription SDK

1. Cactus

2. whisper.cpp

3. Argmax WhisperKit

4. MediaPipe

5. Nexa AI

The Verdict

Frequently asked questions

Try Cactus today