Latency-Optimized Serving Architectures

A blueprint for sub-100ms inference in production LLM deployments using speculative decoding and intelligent caching.

Production LLM applications, chatbots, copilots, search, demand sub-second response times. But even optimized 7B models often exceed 200ms for typical outputs. How do we get to sub-100ms without sacrificing quality?

We built a serving stack combining speculative decoding (2–3x speedup), prompt caching for repeated prefixes (50% latency reduction on multi-turn), and aggressive KV-cache reuse across similar queries. We deployed on vLLM with custom extensions.

We benchmarked on real production workloads: customer support (short responses), code generation (longer outputs), and document Q&A (medium length with RAG). Sub-100ms at P95 was achievable for 60% of support queries and 40% of Q&A; code gen remained slower but improved 2x.

We document the architecture, tuning parameters, and tradeoffs. Speculative decoding requires a draft model; we provide guidance on model selection and memory budget. Caching strategies depend on workload, we characterize when each pays off.

Key Findings

1Speculative decoding + prompt caching achieves 2.5x P95 latency reduction on typical chat workloads.
2Sub-100ms P95 is achievable for ~60% of short-response workloads with 7B models on A10G GPUs.
3KV-cache reuse across similar queries reduces redundant computation by 30% in multi-turn conversations.

More Research

Related publications.

Neural Networks

January 2026

The Architecture of Reason

An exploration of reasoning capabilities in modern language models and architectural patterns that enable multi-step logical inference.

Compute

December 2025

Quantum-Classical Hybrid AI

Investigating the intersection of quantum computing and classical neural networks for combinatorial optimization problems.

Collaborate

Interested in collaborating?

We're always looking for research partners and academic collaborators who want to push AI forward.

Get in Touch All Labs