A project page for our AISTATS 2026 paper on relaxing speculative decoding verification with a learned, context-dependent ensemble of draft and target distributions.
Speculative decoding speeds up LLM generation by drafting multiple tokens and verifying them in parallel. In practice, the bottleneck is often not drafting - it is rigid verification. DIVERSED learns when to stay close to the target model and when it is safe to lean more on the draft model, improving acceptance and end-to-end efficiency.
You run a frontier model lab. You ship three models — call them Lite, Standard, and Pro — and you price them into three fixed tiers. Your customers get three choices. But their workloads don't come in three flavors. A coding agent needs Pro-level reasoning on the tricky function but Lite-level speed on the boilerplate. A summarization pipeline wants near-Pro quality on financial reports but is happy with Lite on internal meeting notes. Today, the only way to serve that spectrum is to maintain and route across multiple model sizes, each with its own deployment, each with its own cost profile. What if a single draft-target pair could serve the entire quality-latency continuum — not in three discrete steps, but continuously?
Every major provider — OpenAI (GPT-4.1 / GPT-4.1 mini / GPT-4.1 nano), Anthropic (Opus / Sonnet / Haiku), Google (Ultra / Pro / Flash) — ships a small family of models at fixed price-quality points. Customers pick a tier and live with its trade-off. Switching tiers means switching endpoints, re-evaluating prompts, and accepting a step-function change in quality. There is no dial between the steps.
Our theory proves that a static ensemble verifier — a simple weighted blend of a draft and target model — exactly traces the Pareto frontier between acceptance rate and distributional fidelity. That means a single deployed pair (e.g., Llama-3.1-8B + Llama-3.2-1B) can serve any point on the speed-quality curve just by turning one knob: the ensemble weight. No retraining, no new endpoints, no fleet management. Continuous pricing becomes a real option.
A static weight is already powerful, but it treats every token the same. DIVERSED learns a context-dependent weight that is strict where correctness is fragile (the pivotal step in a math proof) and permissive where extra acceptance is safe (boilerplate tokens in a summary). The result: it pushes past the static Pareto frontier, achieving quality that a fixed blend cannot reach at the same speed. For a provider, this means offering users a finer-grained quality dial that is also smarter — automatically tightening where it matters.
At each decoding step, DIVERSED forms a verification distribution by mixing the target model distribution p and the draft model distribution q with a learned weight w_t. That weight is predicted from the hidden states of both models, so the verifier can react to the local generation context. Training uses a sequence-level reward together with an acceptance-oriented regularizer, which encourages the model to admit more useful draft tokens without collapsing into always trusting the draft or always defaulting to the target.
The paper provides two clean takeaways. First, a static ensemble verifier exactly traverses the efficiency-quality Pareto frontier predicted by prior theory, which means one draft-target pair can serve many latency-quality points just by changing the mixture weight. Second, the paper derives an exact step-dependent expression for expected accepted length, removing simplifying assumptions that previous analyses relied on.
Across GSM8K, CNNDM, XSum, and MBPP, DIVERSED consistently raises draft-token acceptance while keeping task quality close to the target model. Below are a few representative results from the paper.
ROUGE-2 also improves from 9.46 to 12.11, showing that relaxed verification can be both faster and better on summarization for this model pair.
Accuracy stays at 67%, indicating the extra accepted tokens do not hurt final answer quality on math reasoning in this setting.
Pass@1 remains at 53%, so the verifier becomes far more permissive without sacrificing benchmark-level programming performance.
ROUGE-2 moves from 9.06 to 10.86, again suggesting that better verification can improve both efficiency and output quality.
The most important practical message is simple: in speculative decoding, acceptance rate is not just a nice intermediate metric. It is tightly coupled to wall-clock performance. DIVERSED works because it targets that bottleneck directly.
Another useful observation from the paper is that task-specific relaxation matters. When the dynamic verifier is trained on one task and tested on another, acceptance may still rise, but quality can drop. That suggests relaxed verification should be learned with the downstream task in mind rather than treated as a universal drop-in policy.
Finally, simply fine-tuning the draft model is not enough. The paper shows that stronger draft-task performance does not reliably translate into better acceptance. What matters most is distributional alignment between draft and target conditionals, and DIVERSED addresses that directly at verification time.
If your draft model is already reasonably good but rigid verification still rejects too many tokens, dynamic relaxation is an attractive next step.
The theory makes static ensemble more than a heuristic. It is a principled, training-free knob for moving along the speed-quality frontier.
Summarization, reasoning, and code generation reward different kinds of risk. The learned verifier should reflect that rather than using one universal relaxation policy.
If this project is useful in your work, please cite the paper.
@inproceedings{wang2026diversed,
title={DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification},
author={Wang, Ziyi and Kasa, Siva Rajesh and M S, Ankith and Kasa, Santhosh Kumar and Zou, Jiaru and Negi, Sumit and Zhang, Ruqi and Jiang, Nan and Song, Qifan},
booktitle={Proceedings of The 29th International Conference on Artificial Intelligence and Statistics},
year={2026}
}
LLM inference is the single largest cost center for every model provider. Speculative decoding is already deployed in production at Google (AI Overviews) and is standard in open-source serving stacks like vLLM and SGLang. Yet every deployment today faces the same constraint: verification is all-or-nothing. You either match the target distribution exactly, or you don't.
DIVERSED removes that binary. It gives providers a principled, learned mechanism to offer continuous quality-latency trade-offs from a single model pair. For API providers, this opens the door to usage-based pricing that tracks actual quality delivered per token, not just which model was called. For on-device inference, it means one deployment that adapts its fidelity to the task at hand.
The code is open-source and works with standard HuggingFace model pairs. The ensemble head adds negligible overhead. If you serve LLMs at scale, this is a drop-in upgrade to your speculative decoding pipeline.