Accepted at AISTATS 2026 Speculative Decoding Efficient LLM Inference

DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification

A project page for our AISTATS 2026 paper on relaxing speculative decoding verification with a learned, context-dependent ensemble of draft and target distributions.

Ziyi Wang, Siva Rajesh Kasa, Ankith M S, Santhosh Kumar Kasa, Jiaru Zou, Sumit Negi, Ruqi Zhang, Nan Jiang, Qifan Song
Purdue University · Amazon Inc. · University of Illinois Urbana-Champaign · University of Texas at El Paso
Comparison of rigid verification, static ensemble, and DIVERSED.
Teaser. Standard speculative decoding is often overly conservative: many useful draft tokens are rejected because verification must exactly match the target distribution. DIVERSED replaces that rigid rule with a learned, dynamic verifier that accepts more safe tokens while preserving final quality.

TL;DR

Speculative decoding speeds up LLM generation by drafting multiple tokens and verifying them in parallel. In practice, the bottleneck is often not drafting - it is rigid verification. DIVERSED learns when to stay close to the target model and when it is safe to lean more on the draft model, improving acceptance and end-to-end efficiency.

Core idea
Dynamic ensemble verifier
Blend draft and target distributions with a context-dependent weight.
Theory
Static ensemble traces the Pareto front
A fixed-weight mixture characterizes the acceptance-quality trade-off.
Training
Sequence-level RL
The verifier is trained using task reward plus an acceptance regularizer.
Outcome
Higher acceptance at similar quality
Across summarization, reasoning, and code generation.

The problem every frontier lab faces today

You run a frontier model lab. You ship three models — call them Lite, Standard, and Pro — and you price them into three fixed tiers. Your customers get three choices. But their workloads don't come in three flavors. A coding agent needs Pro-level reasoning on the tricky function but Lite-level speed on the boilerplate. A summarization pipeline wants near-Pro quality on financial reports but is happy with Lite on internal meeting notes. Today, the only way to serve that spectrum is to maintain and route across multiple model sizes, each with its own deployment, each with its own cost profile. What if a single draft-target pair could serve the entire quality-latency continuum — not in three discrete steps, but continuously?

The status quo: rigid tiers

Every major provider — OpenAI (GPT-4.1 / GPT-4.1 mini / GPT-4.1 nano), Anthropic (Opus / Sonnet / Haiku), Google (Ultra / Pro / Flash) — ships a small family of models at fixed price-quality points. Customers pick a tier and live with its trade-off. Switching tiers means switching endpoints, re-evaluating prompts, and accepting a step-function change in quality. There is no dial between the steps.

One pair, infinite trade-offs

Our theory proves that a static ensemble verifier — a simple weighted blend of a draft and target model — exactly traces the Pareto frontier between acceptance rate and distributional fidelity. That means a single deployed pair (e.g., Llama-3.1-8B + Llama-3.2-1B) can serve any point on the speed-quality curve just by turning one knob: the ensemble weight. No retraining, no new endpoints, no fleet management. Continuous pricing becomes a real option.

DIVERSED: beyond the frontier

A static weight is already powerful, but it treats every token the same. DIVERSED learns a context-dependent weight that is strict where correctness is fragile (the pivotal step in a math proof) and permissive where extra acceptance is safe (boilerplate tokens in a summary). The result: it pushes past the static Pareto frontier, achieving quality that a fixed blend cannot reach at the same speed. For a provider, this means offering users a finer-grained quality dial that is also smarter — automatically tightening where it matters.

Method in one paragraph

At each decoding step, DIVERSED forms a verification distribution by mixing the target model distribution p and the draft model distribution q with a learned weight w_t. That weight is predicted from the hidden states of both models, so the verifier can react to the local generation context. Training uses a sequence-level reward together with an acceptance-oriented regularizer, which encourages the model to admit more useful draft tokens without collapsing into always trusting the draft or always defaulting to the target.

Theory highlights

The paper provides two clean takeaways. First, a static ensemble verifier exactly traverses the efficiency-quality Pareto frontier predicted by prior theory, which means one draft-target pair can serve many latency-quality points just by changing the mixture weight. Second, the paper derives an exact step-dependent expression for expected accepted length, removing simplifying assumptions that previous analyses relied on.

Acceptance rate versus ensemble weight for static ensemble.
Static ensemble behaves predictably. As the verifier moves closer to the target model, acceptance decreases approximately linearly.

Main empirical takeaways

Across GSM8K, CNNDM, XSum, and MBPP, DIVERSED consistently raises draft-token acceptance while keeping task quality close to the target model. Below are a few representative results from the paper.

Llama-3.1-8B / Llama-3.2-1B
CNNDM: 21.60% → 69.96% acceptance

ROUGE-2 also improves from 9.46 to 12.11, showing that relaxed verification can be both faster and better on summarization for this model pair.

Llama-3.1-8B / Llama-3.2-1B
GSM8K: 44.60% → 72.61% acceptance

Accuracy stays at 67%, indicating the extra accepted tokens do not hurt final answer quality on math reasoning in this setting.

Llama-3.1-8B / Llama-3.2-1B
MBPP: 26.30% → 85.03% acceptance

Pass@1 remains at 53%, so the verifier becomes far more permissive without sacrificing benchmark-level programming performance.

Gemma-3-12B / Gemma-3-4B
CNNDM: 40.39% → 66.90% acceptance

ROUGE-2 moves from 9.06 to 10.86, again suggesting that better verification can improve both efficiency and output quality.

ROUGE versus average time per sample, showing DIVERSED beyond the static ensemble frontier.
Beyond a static trade-off. On CNNDM, DIVERSED moves past the frontier traced by static ensembles, improving the time-quality balance rather than just choosing a different point on it.
Normalized time versus acceptance rate across datasets and model pairs.
Acceptance is strongly tied to latency. Across datasets and model pairs, higher acceptance reliably corresponds to lower normalized decoding time.

Reading the results

The most important practical message is simple: in speculative decoding, acceptance rate is not just a nice intermediate metric. It is tightly coupled to wall-clock performance. DIVERSED works because it targets that bottleneck directly.

Another useful observation from the paper is that task-specific relaxation matters. When the dynamic verifier is trained on one task and tested on another, acceptance may still rise, but quality can drop. That suggests relaxed verification should be learned with the downstream task in mind rather than treated as a universal drop-in policy.

Finally, simply fine-tuning the draft model is not enough. The paper shows that stronger draft-task performance does not reliably translate into better acceptance. What matters most is distributional alignment between draft and target conditionals, and DIVERSED addresses that directly at verification time.

Practical takeaways

Use DIVERSED when verification is your bottleneck

If your draft model is already reasonably good but rigid verification still rejects too many tokens, dynamic relaxation is an attractive next step.

Static ensemble is a strong baseline

The theory makes static ensemble more than a heuristic. It is a principled, training-free knob for moving along the speed-quality frontier.

Train the verifier for the task you care about

Summarization, reasoning, and code generation reward different kinds of risk. The learned verifier should reflect that rather than using one universal relaxation policy.

Citation

If this project is useful in your work, please cite the paper.

@inproceedings{wang2026diversed,
  title={DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification},
  author={Wang, Ziyi and Kasa, Siva Rajesh and M S, Ankith and Kasa, Santhosh Kumar and Zou, Jiaru and Negi, Sumit and Zhang, Ruqi and Jiang, Nan and Song, Qifan},
  booktitle={Proceedings of The 29th International Conference on Artificial Intelligence and Statistics},
  year={2026}
}

Why this matters for the industry

LLM inference is the single largest cost center for every model provider. Speculative decoding is already deployed in production at Google (AI Overviews) and is standard in open-source serving stacks like vLLM and SGLang. Yet every deployment today faces the same constraint: verification is all-or-nothing. You either match the target distribution exactly, or you don't.

DIVERSED removes that binary. It gives providers a principled, learned mechanism to offer continuous quality-latency trade-offs from a single model pair. For API providers, this opens the door to usage-based pricing that tracks actual quality delivered per token, not just which model was called. For on-device inference, it means one deployment that adapts its fidelity to the task at hand.

The code is open-source and works with standard HuggingFace model pairs. The ensemble head adds negligible overhead. If you serve LLMs at scale, this is a drop-in upgrade to your speculative decoding pipeline.