The standard way to serve a transformer model treats inference as a single problem. A request comes in, the model produces tokens, and somewhere along the way the GPU stays busy. That framing is convenient. It is also wrong in ways that get more expensive every year you run it in production.
Inference has two halves: prefill and decode. They share weights and activations but almost nothing else. They demand different things from the GPU, they bottleneck on different resources, and when you treat them as one workload, the cost is paid in latency variance, throughput on the floor, and engineering hours debugging the symptoms instead of the cause.
This is the first of two notes on the prefill–decode disaggregation problem. The second covers what to do about it: how RDMA and the NIXL protocol move the KV cache between specialized workers without paying the latency tax that naive transfer would impose. This first note is about why you would want to do that at all.
What prefill actually does
Prefill is the phase where the model reads the prompt. For a request with N input tokens, prefill runs all N tokens through the network in a single forward pass — every layer, every attention head, every feedforward block. The output is a key-value cache: a large tensor that records the per-layer attention state for every input token, ready to be referenced when generation begins.
Two properties define prefill on modern hardware. First, it is highly parallelizable. The N input tokens can be processed concurrently within a single forward pass, so prefill keeps the GPU's compute units saturated. Second, it is compute-bound. The arithmetic intensity is high — many floating-point operations per byte of memory traffic — and on a current-generation accelerator like an H100 or H200, that means the bottleneck is FLOPs, not memory bandwidth.
The practical consequence is that prefill loves long sequences and big batches. Throughput in tokens-per-second scales close to linearly with batch size until you exhaust GPU memory or violate a latency SLO. For a serving system that cared about prefill efficiency in isolation, the obvious move would be to fill every batch as wide and as long as the hardware allows.
What decode actually does
Decode is the phase where the model produces output tokens. For a generation of M tokens, decode runs M sequential forward passes, each producing exactly one token. Each pass reads the weights, references the KV cache built during prefill (and extended by previous decode steps), computes one new token, and appends to the cache.
The properties are inverted. First, decode is sequential by definition: you cannot generate token t+1 until you have produced token t. Second, decode is memory-bandwidth-bound. Each forward pass touches every weight in the model — billions of parameters — to produce a single token. Arithmetic intensity collapses. The GPU's compute units sit underutilized while HBM bandwidth becomes the limiter.
The consequence is that decode efficiency depends on the number of sequences you are generating in parallel, not the number of tokens in any one sequence. A single sequence in decode wastes most of the GPU. Sixty-four concurrent sequences, batched together, share the cost of each weight read across all of them.
This is why continuous batching matters. It is the technique that lets a serving system insert and remove sequences from a decode batch on every step — keeping the batch size as large as possible, even as individual requests start and finish at unpredictable times. Continuous batching solves the decode efficiency problem. But it only solves it if decode runs on hardware that is not also being asked to do prefill.
The hidden cost of running them together
Most LLM serving stacks today co-locate prefill and decode on the same GPU. A new request comes in, the system runs prefill to build the KV cache, then transitions to decode. The same physical hardware does both jobs, scheduled by the runtime.
The cost of this approach is not visible on a throughput chart. It shows up in two places: time-to-first-token (TTFT) variance under load, and aggregate hardware utilization that never reaches what either workload could deliver in isolation.
Consider the TTFT failure mode. A user submits a short request. The server is currently in the middle of a long prefill from another tenant — say, a 32K-token document being ingested by an agent. Until that prefill completes, the new request cannot start its own prefill, and the user's first token is delayed by hundreds of milliseconds, sometimes seconds. This is the head-of-line blocking problem in inference, and it gets worse the more concurrent traffic the system handles.
The utilization story is the symmetric problem. When prefill and decode interleave on the same hardware, neither runs at the batch sizes that would saturate it. Prefill batches stay small because the system needs to flip back to decode to serve outstanding requests. Decode batches stay smaller than they should be because clock cycles are being stolen by prefill work. Both phases run inefficiently, and the total throughput is meaningfully below what either could deliver alone.
The deeper issue is that the two workloads have incompatible scaling preferences. Prefill wants to run fewer, larger batches at high frequency. Decode wants to run continuous, large batches with as little interruption as possible. A single GPU cannot do both well at the same time.
Disaggregated inference
The alternative is to split prefill and decode onto separate workers. A prefill cluster takes incoming requests, runs the prompt through the model, and produces a KV cache. A decode cluster receives that cache and generates output tokens. Each cluster runs at the batch size that suits its workload, on hardware sized for its actual bottleneck.
The result is dramatic on paper and substantial in practice. Recent work on disaggregated serving — DistServe, the SGLang and TensorRT-LLM efforts, NVIDIA's Dynamo platform — reports throughput-per-GPU improvements of two to four times over co-located serving, with TTFT and inter-token-latency tails that are significantly tighter under concurrent load.
But disaggregation introduces a new problem: the KV cache, which used to live on the same GPU that needed it, now has to move. For a typical serving configuration, that cache is hundreds of megabytes per request, and it has to arrive at the decode worker before the first decode step can begin. If the transfer is slow, you have just traded one latency problem for another.
This is where most production deployments stall. The disaggregation idea is well understood. The mechanism for moving the cache without paying an unacceptable latency tax is not. Solving it requires reaching past the standard networking stack — past TCP, past the kernel — into RDMA and into hardware-aware placement of the workers themselves. That is the subject of the next note.
What this means for agentic workloads
Everything above gets worse when you serve agents instead of chat. An agent makes many model calls per user-facing turn — tool selection, observation summarization, planning, output generation — and each of those calls has its own prefill and decode. The prefill component grows fast: long system prompts, accumulated tool outputs, reference materials retrieved from a knowledge base, reasoning traces from prior steps. It is not unusual for agentic workloads to spend more aggregate compute on prefill than on decode.
At the same time, the latency budget for an agent is brutal. A user expects an agent to respond in a few seconds. If that response involves five model calls, each call has at most a few hundred milliseconds of TTFT to spend. A single tail-latency event during a co-located prefill blows the budget for the entire turn.
The conclusion follows. Production agent serving cannot be done well on a co-located stack at scale. The economic argument — better throughput per GPU — is the headline. The latency argument — predictable tail behavior under realistic load — is the one that matters when an agent has to ship.
Where this goes
This note has argued for splitting prefill and decode. The next note in this series covers the harder problem: moving the KV cache between the two halves at line rate, with topology-aware placement that keeps tail latency in check. The mechanics involve RDMA, the NIXL protocol, and a set of deployment decisions that most production teams leave on the table.
In less than two years, disaggregated inference has moved from research papers to production systems at the largest LLM serving operators. The teams that get it right will have a structural cost advantage, and a latency profile their competitors cannot match. The teams that do not will keep paying for hardware they are not using.
References & further reading
- Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving, OSDI 2024.
- Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models, OSDI 2022 — the foundational continuous-batching paper.
- Patel et al., Splitwise: Efficient Generative LLM Inference Using Phase Splitting, ISCA 2024.
- NVIDIA Dynamo — distributed inference framework, prefill–decode disaggregation chapter.
- SGLang — scheduler design notes and disaggregation modes.