EngineeringJanuary 28, 2026 · 12 min read

Sub-50ms or Bust: Why Inference Latency Is the New Uptime

A 200ms response is a broken product for a trading desk. This is how we engineered our way to 48ms P50 — and why every millisecond was a deliberate architectural choice.

Marcus Thorn

Head of Infrastructure, AIRMY

Before I joined AIRMY, I spent four years at Google DeepMind building the distributed serving infrastructure behind large-scale model deployments. The latency targets there were, to put it diplomatically, "aspirational." Research teams were happy with 800ms P50 responses. Production systems wanted 200ms. Everyone considered anything under 100ms a rounding error.

When Priya hired me, she gave me a constraint I initially thought was absurd: 50ms P50 latency for complex multi-step agent calls. Not for a single-turn prompt. For an agent that reads context, uses tools, and produces a structured response.

Eighteen months later, our production P50 sits at 48ms. Here's how we got there — and more importantly, why it matters enough to be worth the engineering investment.

Why latency is load-bearing, not nice-to-have

The naive framing is that latency affects user experience — slow responses feel bad. That's true, but it understates the problem. For the kinds of workflows AIRMY agents run, latency determines whether the integration is architecturally feasible at all.

Consider an algorithmic trading system where a portfolio risk agent runs on every proposed order. If that agent takes 400ms to respond, you've added 400ms to your order execution path. At that point, you don't have an AI agent — you have a bottleneck. The agent can't be in the hot path; it gets pushed to a batch job that runs every 15 minutes. That changes the entire value proposition.

Or consider a backend engineering agent that's integrated into a developer's coding environment. If every suggestion takes 2 seconds, developers stop waiting. They context-switch. The integration becomes a party trick rather than a productivity tool. The threshold for "fast enough to stay in the loop" is well under 100ms for interactive applications.

48ms

P50 latency, production (Jan 2026)

112ms

P99 latency, production

0.01%

error rate across all agent calls

This is why we treat latency as a hard constraint during agent design — not an optimization pass we run later. Every architectural decision starts with: "Does this blow the budget?"

The latency budget breakdown

To engineer towards a target, you need to know where your time is going. A naïve agent call has roughly five phases:

Network transit (client → edge node)
Context assembly (loading memory, tools, system prompt)
Prefill (processing the input tokens)
Decode (generating the output tokens)
Post-processing and response serialization

For a typical 8k-token context with a 200-token output, the breakdown at baseline looks roughly like this:

Latency budget breakdown (baseline → optimized)

Network

3ms

Context assembly

9ms

Prefill (KV hit)

14ms

Decode

18ms

Post-processing

4ms

P50 total48ms

Those numbers represent the current optimized state. At launch, our baseline P50 was 340ms. The 7× improvement came from attacking each phase deliberately. Let me walk through the biggest wins.

KV cache sharing: the biggest lever

The largest single optimization was aggressive KV cache sharing across requests. For readers unfamiliar with transformer inference: the "key-value cache" stores intermediate computations from the attention layers. If two requests share a common prefix (say, a system prompt and a standard tool definition schema), you can reuse those computations rather than re-running them.

The insight that unlocked our P50 target: every AIRMY agent of the same type shares an identical system prompt, tool schema, and initial context structure. That prefix can be enormous — for our Backend Engineer agent, the system prompt and tool definitions alone represent about 4,000 tokens. At 8k total context, that's 50% of the input already cached.

With a warm prefix cache, prefill time for those 4,000 tokens drops to near zero. We're only computing attention for the novel tokens in the request. For repetitive production workloads — where the same agent is called thousands of times per day with similar context structures — cache hit rates exceed 80%.

# Simplified view of our KV cache key structure
# Prefix components that are sharable across requests

cache_key = {
  "agent_id":      "@airmy/backend-engineer",
  "agent_version": "2.4.1",
  "system_hash":   sha256(system_prompt + tool_schema),
  "prefix_tokens": 4096,   # cached
  "novel_tokens":  varies,   # computed fresh per request
}

The engineering challenge is cache invalidation and memory pressure. Maintaining hot KV caches for 240 agents at multiple context lengths is non-trivial GPU memory management. We built a custom eviction policy that weights by recency, access frequency, and prefix length — longer shared prefixes get protected eviction because the cost of recomputing them is higher.

Speculative decoding at the decode layer

The decode phase — where the model generates output tokens one at a time — is inherently serial and hard to parallelize. Each token depends on the previous one. For 200 output tokens at typical decode speed, this takes around 60-80ms on its own — already over budget.

We deploy speculative decoding across the full production fleet. The approach: a small "draft" model predicts the next 4-8 tokens ahead of the main model. The main model then verifies those predictions in a single forward pass (which is parallelizable). When the draft model is right — which happens around 70% of the time for structured, predictable outputs — you effectively get 4-8 tokens for the latency cost of one.

For AIRMY agents, which produce highly structured outputs (JSON, code, markdown with predictable schemas), draft acceptance rates are significantly higher than for open-ended generation. Our agents know what they're supposed to output. That predictability is an asset.

"Speculative decoding is almost free when your outputs are structured. The draft model can predict the next five tokens of a JSON schema with near certainty. You're essentially getting parallelism where the math says you can't have it."

Edge deployment: eliminating the speed of light

The 3ms network figure in our budget breakdown assumes the client is hitting a nearby edge node. Achieving that requires a deployment topology that puts inference capacity close to where requests originate.

We operate inference nodes in 14 regions. Model weights are fully replicated; KV caches are region-local (with cross-region fallback). For teams in, say, Singapore, requests route to our Singapore inference cluster — not to US-East. At 3ms network, you're not fighting the speed of light.

The tradeoff is cost and operational complexity. Running 14 full inference deployments is expensive. We offset it through request batching at each edge node (micro-batches of 4-8 requests grouped within a 2ms window), which improves GPU utilization enough to make the economics work.

Context assembly: the silent killer

The 9ms context assembly figure is a number I'm proud of, because it was 80ms six months ago. Context assembly — loading the agent's relevant memory, injecting tool results, constructing the final prompt — used to be synchronous and serialized. We'd load from a vector store, wait for results, format, assemble, then pass to the inference engine.

We restructured this entirely. Context components that don't depend on each other (memory retrieval, tool schema hydration, system prompt injection) now load in parallel. We precompute the static components (system prompt, tool definitions) at agent instantiation time rather than per-request. Hot memory entries are cached in a fast sidecar store co-located with the inference node.

The result is that by the time a request arrives at the inference node, ~75% of the context is already assembled. We're only doing dynamic work for the parts that depend on the specific request.

What we haven't done

It's worth being honest about what we've chosen not to do in pursuit of lower latency. We haven't quantized our models below INT8 — the precision tradeoff isn't worth it for the task accuracy our agents need. We haven't used extremely small models ("nanomodels") that can respond in 10ms — the capability gap is too large. And we haven't cached full responses, because agent outputs need to be dynamic and context-sensitive.

Every optimization we've shipped has preserved full model capability. The 48ms P50 is not a dumbed-down product. It's the same quality agent, faster.

The next target

We're working toward 30ms P50 for our most-used agent types. The primary lever is further improving draft model acceptance rates through task-specific draft models trained on agent output distributions. If you're curious about the details, we'll publish the technical report in Q2.

The broader point: latency isn't a problem you solve once. It's an engineering discipline you maintain. As context windows grow, as tool schemas become more complex, as agents chain into multi-step workflows — the pressure on the budget never goes away. You have to keep engineering.

48ms took 18 months. 30ms will take another year. That's the job.

Marcus Thorn

Head of Infrastructure, AIRMY. Previously at Google DeepMind where he designed the distributed serving stack behind Gemini's inference layer.

Follow on GitHub

/ More from the blog