Diffusion LLM may make most of the AI engineering stack obsolete
victorpiles99 Wednesday, March 11, 2026I've been deep-diving into diffusion language models this week and I think this is the most underrated direction in AI right now.
The core issue with autoregressive LLMs:
Every major model today (GPT, Claude, Gemini) generates one token at a time, left to right. Each token depends on the previous one. This single architectural constraint has shaped the entire AI industry:
- Models can't revise what they already wrote → we build chain-of-thought, reflection, and multi-pass reasoning to force them to "think before committing" - One forward pass per token → we invest heavily in speculative decoding, KV-caches, and quantization to make generation tolerable - Can't edit mid-output → we build agent frameworks with retry loops, tool calls, and planning layers to work around it - Can't generate in parallel → we build orchestration systems that chain multiple slow calls together
Most of what we call "AI engineering" today is patching around one thing: the model can't look back.
Diffusion LMs flip the paradigm. Start with a canvas of masked tokens, iteratively refine the entire output in parallel. Every position updated simultaneously, the model sees and edits all of its output at every step. Same principle as image diffusion (Stable Diffusion, DALL-E), applied to text.
Why I think the theory actually holds:
1. Parallelism is real, not theoretical. Inception Labs' Mercury 2 (closed-source, diffusion-based) already hits ~1000 tok/s with quality competitive with GPT-4o mini on MMLU, HumanEval, MATH. That's not a benchmark trick — it's a direct consequence of not being bottlenecked by sequential generation. 2. The complexity reduction is massive. If a model can see and edit its entire output at once, you don't need half the scaffolding we've built: reflection prompting becomes native (the model already iterates on its own output), retry loops become unnecessary (edit in place), planning agents get simpler (the model can restructure, not just append). The whole stack flattens. 3. The conversion path exists. You can take an existing pretrained AR model and convert it to diffusion via fine-tuning alone — no pretraining from scratch. This means the billions already invested in AR pretraining aren't wasted. It's an upgrade path, not a restart.
The main limitation today: fixed output length. You must pre-allocate the canvas size before generation starts. Block Diffusion (generating in sequential chunks, diffusing within each chunk) is one workaround. Hierarchical generation — outline first, expand sections in parallel — is another. Ironically, orchestrating that requires an agent, so diffusion doesn't kill agents — it changes what they do.
Honest take: Open diffusion LMs still trail top AR models on knowledge and reasoning at comparable scale. But Mercury 2 shows the ceiling is high, the conversion results are surprisingly good, and the architecture eliminates entire categories of engineering complexity. I think within a year we'll see diffusion models competitive with frontier AR models, and when that happens, a lot of the current tooling (agent frameworks, prompt engineering techniques, inference optimization stacks) gets dramatically simpler or unnecessary.
While researching all this I found dLLM, an open-source library that unifies training, inference, and evaluation for diffusion LMs. It has recipes for LLaDA, Dream, Block Diffusion, and converting any AR model to diffusion. Good starting point if you want to experiment.
Paper: https://arxiv.org/abs/2602.22661
Code: https://github.com/ZHZisZZ/dllm
Models: https://huggingface.co/dllm-hub
What is your opinion?