DeepSeek made its models run faster, and then open-sourced the part that lets everyone else do it too, Zubnet AI News

DeepSeek has released DSpark, a speculative decoding framework that makes its DeepSeek-V4 Flash and Pro models generate text faster. It is shipping as enhanced checkpoints, which is to say the same underlying model with a small extra decoding module attached, not a new model with new capabilities. The point is not a smarter system, it is a cheaper and quicker one.

Speculative decoding is worth understanding because it is one of the quietest and most useful levers in AI economics. Normally a large model produces text one token at a time, each step waiting on the last, which is slow. With speculative decoding, a small fast draft model guesses several tokens ahead, and the large model checks all of those guesses at once. When the guesses are right, and they often are for ordinary text, you get the same output the big model would have produced, but in far fewer slow sequential steps. The result is identical quality at higher speed.

DSpark's specific contribution is in how it makes those guesses. It combines two existing approaches: a heavy parallel head, in the style of a method called DFlash, with a small sequential head that works more like the Eagle family, using a lightweight Markov step. The blend raises the acceptance rate, meaning more of the draft model's guessed tokens survive the big model's check, which is the number that actually determines how much speed you gain. By DeepSeek's own testing, DSpark beats both Eagle3 and DFlash, increasing accepted token length by roughly 16 to 31 percent and lifting throughput by anywhere from 51 percent to as much as 400 percent depending on the task, with lower latency.

The more consequential move is what DeepSeek did alongside the framework. It open-sourced DeepSpec, a full codebase for training and evaluating the small draft models that speculative decoding depends on, and crucially it is not limited to DeepSeek's own models. DeepSpec is built to work on other open models too, including Google's Gemma and Alibaba's Qwen. That turns a private speedup into a shared tool: anyone running those open models can train a draft model and capture similar gains, rather than waiting for each lab to ship its own proprietary version.

The honest caveats are the usual ones for performance claims. The numbers are DeepSeek's own and have not been independently verified, and speculative decoding gains swing widely with the workload, so the headline 400 percent is a best case for friendly tasks rather than a number anyone should expect across the board. But the through-line matters more than any single figure. Inference, the cost of actually running a model once it exists, is where most of the money in deployed AI is spent, and a steady stream of techniques like this one keeps pushing that cost down. Open-sourcing the toolkit, and making it work across other labs' models, spreads the benefit wider than DeepSeek's own balance sheet. The flashy releases get the headlines, but it is work like this that quietly decides how affordable AI actually becomes.

DeepSeek made its models run faster, and then open-sourced the part that lets everyone else do it too

More News