Show HN: 17MB pronunciation scorer beats human experts at phoneme level
fabiosuizu Friday, February 20, 2026I built an English pronunciation assessment engine that fits in 17MB and runs in under 300ms on CPU.
Architecture: CTC forced alignment + GOP scoring + ensemble heads (MLP + XGBoost). No wav2vec2 or large self-supervised models — the entire pipeline uses a quantized NeMo Citrinet-256 as the acoustic backbone.
Benchmarked on speechocean762 (standard academic benchmark, 2500 utterances): - Phone accuracy (PCC): 0.580 — exceeds human inter-annotator agreement (0.555) - Sentence accuracy: 0.710 — exceeds human agreement (0.675) - Model is 70x smaller than wav2vec2-based SOTA
Trade-off: we're ~10-15% below SOTA on raw accuracy. But for real-time feedback in language learning apps, the latency/size trade-off is worth it.
Available as REST API, MCP server (for AI agents), and on Azure Marketplace.
Demo: https://huggingface.co/spaces/fabiosuizu/pronunciation-asses...
Interested in feedback on the scoring approach and use cases people would find valuable.