Ask HN: Why is Apple's voice transcription hilariously bad?

Why is Apple’s voice transcription so hilariously bad?

Even 2–3 years ago, OpenAI’s Whisper models delivered better, near-instant voice transcription offline — and the model was only about ~500 MB. With that context, it’s hard to understand how Apple’s transcription, which runs online on powerful servers, performs so poorly today.

Here are real examples from using the iOS native app just now:

- “BigQuery update” → “bakery update”

- “GitHub” → “get her”

- “CI build” → “CI bill”

- “GitHub support” → “get her support”

These aren’t obscure terms — they’re extremely common words in software, spoken clearly in casual contexts. The accuracy gap feels especially stark compared to what was already possible years ago, even fully offline.

Is this primarily a model-quality issue, a streaming/segmentation problem, aggressive post-processing, or something architectural in Apple’s speech stack? What are the real technical limitations, and why hasn’t it improved despite modern hardware and cloud processing?

Story

Ask HN: Why is Apple's voice transcription hilariously bad?