Story

Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

matthewolfe Monday, June 30, 2025

TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.

I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.

Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.

Summary
TokenDagger is an open-source project that provides a simple and secure way to manage API keys and tokens for various services. It offers features like token rotation, revocation, and usage monitoring to help developers maintain control over their application's security.
277 72
Summary
github.com
Visit article Read on Hacker News Comments 72