Latency-Optimized Serving Architectures
All Labs
Infrastructure
August 2025

Latency-Optimized Serving Architectures

A blueprint for sub-100ms inference in production LLM deployments using speculative decoding and intelligent caching.

Latency-Optimized Serving Architectures

Production LLM applications, chatbots, copilots, search, demand sub-second response times. But even optimized 7B models often exceed 200ms for typical outputs. How do we get to sub-100ms without sacrificing quality?

We built a serving stack combining speculative decoding (2–3x speedup), prompt caching for repeated prefixes (50% latency reduction on multi-turn), and aggressive KV-cache reuse across similar queries. We deployed on vLLM with custom extensions.

We benchmarked on real production workloads: customer support (short responses), code generation (longer outputs), and document Q&A (medium length with RAG). Sub-100ms at P95 was achievable for 60% of support queries and 40% of Q&A; code gen remained slower but improved 2x.

We document the architecture, tuning parameters, and tradeoffs. Speculative decoding requires a draft model; we provide guidance on model selection and memory budget. Caching strategies depend on workload, we characterize when each pays off.

Key Findings

  • 1Speculative decoding + prompt caching achieves 2.5x P95 latency reduction on typical chat workloads.
  • 2Sub-100ms P95 is achievable for ~60% of short-response workloads with 7B models on A10G GPUs.
  • 3KV-cache reuse across similar queries reduces redundant computation by 30% in multi-turn conversations.
Collaborate

Interested in collaborating?

We're always looking for research partners and academic collaborators who want to push AI forward.