DIVERSED - Relaxed Speculative Decoding via Dynamic Ensemble Verification

The problem every frontier lab faces today

You run a frontier model lab. You ship three models — call them Lite, Standard, and Pro — and you price them into three fixed tiers. Your customers get three choices. But their workloads don't come in three flavors. A coding agent needs Pro-level reasoning on the tricky function but Lite-level speed on the boilerplate. A summarization pipeline wants near-Pro quality on financial reports but is happy with Lite on internal meeting notes. Today, the only way to serve that spectrum is to maintain and route across multiple model sizes, each with its own deployment, each with its own cost profile. What if a single draft-target pair could serve the entire quality-latency continuum — not in three discrete steps, but continuously?

The status quo: rigid tiers

Every major provider — OpenAI (GPT-4.1 / GPT-4.1 mini / GPT-4.1 nano), Anthropic (Opus / Sonnet / Haiku), Google (Ultra / Pro / Flash) — ships a small family of models at fixed price-quality points. Customers pick a tier and live with its trade-off. Switching tiers means switching endpoints, re-evaluating prompts, and accepting a step-function change in quality. There is no dial between the steps.

One pair, infinite trade-offs

Our theory proves that a static ensemble verifier — a simple weighted blend of a draft and target model — exactly traces the Pareto frontier between acceptance rate and distributional fidelity. That means a single deployed pair (e.g., Llama-3.1-8B + Llama-3.2-1B) can serve any point on the speed-quality curve just by turning one knob: the ensemble weight. No retraining, no new endpoints, no fleet management. Continuous pricing becomes a real option.

DIVERSED: beyond the frontier

A static weight is already powerful, but it treats every token the same. DIVERSED learns a context-dependent weight that is strict where correctness is fragile (the pivotal step in a math proof) and permissive where extra acceptance is safe (boilerplate tokens in a summary). The result: it pushes past the static Pareto frontier, achieving quality that a fixed blend cannot reach at the same speed. For a provider, this means offering users a finer-grained quality dial that is also smarter — automatically tightening where it matters.

Method in one paragraph

At each decoding step, DIVERSED forms a verification distribution by mixing the target model distribution p and the draft model distribution q with a learned weight w_t. That weight is predicted from the hidden states of both models, so the verifier can react to the local generation context. Training uses a sequence-level reward together with an acceptance-oriented regularizer, which encourages the model to admit more useful draft tokens without collapsing into always trusting the draft or always defaulting to the target.

Theory highlights

The paper provides two clean takeaways. First, a static ensemble verifier exactly traverses the efficiency-quality Pareto frontier predicted by prior theory, which means one draft-target pair can serve many latency-quality points just by changing the mixture weight. Second, the paper derives an exact step-dependent expression for expected accepted length, removing simplifying assumptions that previous analyses relied on.

Acceptance rate versus ensemble weight for static ensemble.

Static ensemble behaves predictably. As the verifier moves closer to the target model, acceptance decreases approximately linearly.

Main empirical takeaways

Across GSM8K, CNNDM, XSum, and MBPP, DIVERSED consistently raises draft-token acceptance while keeping task quality close to the target model. Below are a few representative results from the paper.

Llama-3.1-8B / Llama-3.2-1B

CNNDM: 21.60% → 69.96% acceptance

ROUGE-2 also improves from 9.46 to 12.11, showing that relaxed verification can be both faster and better on summarization for this model pair.

Llama-3.1-8B / Llama-3.2-1B

GSM8K: 44.60% → 72.61% acceptance

Accuracy stays at 67%, indicating the extra accepted tokens do not hurt final answer quality on math reasoning in this setting.

Llama-3.1-8B / Llama-3.2-1B

MBPP: 26.30% → 85.03% acceptance

Pass@1 remains at 53%, so the verifier becomes far more permissive without sacrificing benchmark-level programming performance.

Gemma-3-12B / Gemma-3-4B

CNNDM: 40.39% → 66.90% acceptance

ROUGE-2 moves from 9.06 to 10.86, again suggesting that better verification can improve both efficiency and output quality.

ROUGE versus average time per sample, showing DIVERSED beyond the static ensemble frontier.

Beyond a static trade-off. On CNNDM, DIVERSED moves past the frontier traced by static ensembles, improving the time-quality balance rather than just choosing a different point on it.

Normalized time versus acceptance rate across datasets and model pairs.

Acceptance is strongly tied to latency. Across datasets and model pairs, higher acceptance reliably corresponds to lower normalized decoding time.

Reading the results

The most important practical message is simple: in speculative decoding, acceptance rate is not just a nice intermediate metric. It is tightly coupled to wall-clock performance. DIVERSED works because it targets that bottleneck directly.

Another useful observation from the paper is that task-specific relaxation matters. When the dynamic verifier is trained on one task and tested on another, acceptance may still rise, but quality can drop. That suggests relaxed verification should be learned with the downstream task in mind rather than treated as a universal drop-in policy.

Finally, simply fine-tuning the draft model is not enough. The paper shows that stronger draft-task performance does not reliably translate into better acceptance. What matters most is distributional alignment between draft and target conditionals, and DIVERSED addresses that directly at verification time.

Practical takeaways

Use DIVERSED when verification is your bottleneck

If your draft model is already reasonably good but rigid verification still rejects too many tokens, dynamic relaxation is an attractive next step.

Static ensemble is a strong baseline

The theory makes static ensemble more than a heuristic. It is a principled, training-free knob for moving along the speed-quality frontier.

Train the verifier for the task you care about

Summarization, reasoning, and code generation reward different kinds of risk. The learned verifier should reflect that rather than using one universal relaxation policy.

Citation

If this project is useful in your work, please cite the paper.

@inproceedings{wang2026diversed,
  title={DIVERSED: Relaxed Speculative Decoding via Dynamic Ensemble Verification},
  author={Wang, Ziyi and Kasa, Siva Rajesh and M S, Ankith and Kasa, Santhosh Kumar and Zou, Jiaru and Negi, Sumit and Zhang, Ruqi and Jiang, Nan and Song, Qifan},
  booktitle={Proceedings of The 29th International Conference on Artificial Intelligence and Statistics},
  year={2026}
}

Why this matters for the industry

LLM inference is the single largest cost center for every model provider. Speculative decoding is already deployed in production at Google (AI Overviews) and is standard in open-source serving stacks like vLLM and SGLang. Yet every deployment today faces the same constraint: verification is all-or-nothing. You either match the target distribution exactly, or you don't.

DIVERSED removes that binary. It gives providers a principled, learned mechanism to offer continuous quality-latency trade-offs from a single model pair. For API providers, this opens the door to usage-based pricing that tracks actual quality delivered per token, not just which model was called. For on-device inference, it means one deployment that adapts its fidelity to the task at hand.

The code is open-source and works with standard HuggingFace model pairs. The ensemble head adds negligible overhead. If you serve LLMs at scale, this is a drop-in upgrade to your speculative decoding pipeline.