Speculative Decoding: 3x Faster Inference Without Losing Quality

Large language models are powerful, but they are slow.

Autoregressive decoding generates one token at a time. Each token requires a forward pass through the entire model. When you are serving a 70 billion parameter model in production, that cost is significant.

Speculative decoding changes the economics of generation.

In one of our production deployments, implementing speculative decoding reduced P95 latency by 65 percent and nearly tripled throughput, without measurable quality degradation. The technique is elegant, mathematically grounded, and surprisingly practical.

The Core Idea

Autoregressive models generate text sequentially. Given a prompt, the model predicts the next token. That token is appended to the sequence. The process repeats.

Speculative decoding introduces a second, smaller model known as the draft model.

Instead of asking the large target model to generate one token at a time, the draft model proposes multiple tokens in advance. The target model then verifies those tokens in parallel.

If the draft tokens match what the target model would have produced, they are accepted in bulk. If they diverge, the system rolls back to the point of disagreement and resumes normal decoding.

In effect, you generate multiple tokens for the computational cost of a single verification pass.

Why It Works

The key insight is that the expensive part of generation is the large model forward pass. If a smaller model can predict the next several tokens correctly with high probability, you amortize the cost of the large model across multiple tokens.

Let k be the number of draft tokens proposed. If the draft model is correct for most of those tokens, the large model verifies all k tokens in one pass instead of performing k sequential passes.

The expected speedup depends on the acceptance rate of draft tokens.

If the draft model predicts accurately 70 percent of the time, and proposes sequences of length four, the average number of large model passes decreases dramatically.

In practice, well-aligned draft models often yield 2 to 3 times speedup.

Mathematical Intuition

Speculative decoding maintains exact output equivalence with standard decoding under greedy or sampling strategies. The verification step ensures correctness.

The target model computes logits for the proposed tokens in a single forward pass. Acceptance is determined by comparing probability distributions and ensuring that sampled tokens remain valid under the target distribution.

Because the acceptance criterion preserves distributional correctness, output quality remains unchanged in expectation.

You are not approximating the target model. You are accelerating it.

Implementation Requirements

The draft model must share the same tokenizer as the target model.

Vocabulary alignment is mandatory. Token boundary mismatches break verification logic.

The draft model should approximate the distribution of the target model. It does not need to be perfect. It needs to be right often enough.

Common pairings include using an eight billion parameter model as draft for a seventy billion parameter target. Distilled models also work well if trained on similar corpora.

In our deployment, we paired Llama 3 8B as the draft model with Llama 3 70B as the target model.

Both models shared tokenizer compatibility and similar pretraining distributions, which improved acceptance rates.

Where Speculative Decoding Excels

Speculative decoding performs best in predictable continuation tasks.

Code generation is a prime example. Programming languages are syntactically constrained and statistically regular. Draft models often predict continuations accurately for multiple tokens.

Structured output tasks such as JSON generation or templated responses also see strong gains.

Document question answering with retrieval augmentation performs well because answers are grounded in retrieved context, reducing randomness.

Creative writing tasks show smaller gains. When token distributions are more diverse, draft model predictions are less frequently accepted.

Our Production Deployment

We implemented speculative decoding in a production document question answering system serving enterprise users.

Baseline architecture used Llama 3 70B hosted on GPU instances with vLLM as the serving framework.

Baseline P95 latency for medium-length responses was approximately 3.4 seconds.

We introduced an 8B draft model deployed on the same inference cluster.

Using vLLM’s speculative decoding support, we configured the draft model to propose sequences of four tokens per iteration.

Acceptance rate stabilized around 68 percent in steady-state traffic.

Results were immediate.

P95 latency dropped to 1.2 seconds.

Mean latency dropped by over 50 percent.

Throughput increased significantly because fewer large-model forward passes were required per request.

Crucially, we observed no statistically significant difference in answer quality across our evaluation set.

Evaluation Methodology

We ran a golden evaluation set consisting of 500 enterprise document queries.

Metrics included exact match, answer relevance scoring, and human-graded faithfulness.

Comparing baseline greedy decoding with speculative decoding outputs yielded equivalent distributions.

Because speculative decoding preserves distributional equivalence, quality regression is unlikely unless implementation is flawed.

Infrastructure Considerations

Speculative decoding requires additional GPU memory for the draft model.

However, draft models are significantly smaller. The memory overhead is often modest relative to the large model footprint.

GPU utilization patterns change. Instead of sequential heavy passes, the system performs fewer but more parallelized verification passes.

Careful scheduling improves gains. Co-locating draft and target models reduces inter-GPU communication latency.

Framework Support

Not all inference frameworks support speculative decoding natively.

vLLM provides production-ready support.

Other frameworks are integrating it, but maturity varies.

Custom implementations require careful handling of token acceptance logic and distribution preservation.

Common Pitfalls

Using an underpowered draft model reduces acceptance rates and negates gains.

Tokenizer mismatches introduce subtle bugs.

Improper sampling synchronization can alter output distributions.

Failure to benchmark across realistic traffic patterns leads to misleading conclusions.

Cost Implications

Speculative decoding reduces compute cost per generated token for the large model.

Even with the added cost of running a draft model, overall GPU consumption typically decreases for latency-sensitive workloads.

For high-throughput systems, cost savings can be substantial over time.

When It Is Worth the Investment

If you are serving large models in production and latency matters, speculative decoding is one of the highest-leverage optimizations available.

If you are running small models or batch workloads with relaxed latency constraints, gains may not justify engineering effort.

The Strategic Takeaway

Speculative decoding is not a model improvement. It is a systems improvement.

It exploits the fact that smaller models approximate larger ones surprisingly well for short continuations.

By verifying instead of regenerating, you trade redundant computation for parallel validation.

In production LLM systems, that trade can mean the difference between a sluggish assistant and a real-time experience.

The optimization does not change your model’s intelligence.

It changes how efficiently you use it.

Speculative Decoding: 3x Faster Inference Without Losing Quality

Related reading.

The State of Enterprise AI Adoption: What the Data Says