The previous note argued that prefill and decode are different jobs and should run on different workers. Once you accept that argument, you have replaced one problem with another. The KV cache that used to live on the GPU that needed it now has to move. Hundreds of megabytes per request, generated by a prefill worker, consumed by a decode worker — and the user is waiting for the first token while it crosses the wire.
This is where most disaggregated inference deployments stop being interesting in benchmarks and start being painful in production. The disaggregation idea is well understood. The transfer problem is not. Solving it requires reaching past the standard networking stack, into RDMA and into the topology of the cluster itself.
This is the second of two notes on prefill–decode disaggregation. The first covered the why. This one covers the how.
What you are actually transferring
Before talking about the wire, it is worth being precise about the payload.
For a request with N input tokens running on a model with L transformer layers, H attention heads, and head dimension D, the KV cache size scales as 2 × L × H × D × N × bytes-per-element. The factor of two accounts for both the keys and the values; N is the prompt length; the bytes-per-element depends on the precision used.
Concretely, for a 70-billion-parameter model with 80 layers, 64 attention heads, head dimension 128, and fp16 KV: roughly 320KB per token. A 4K-token prompt produces a 1.3GB cache. An 8K-token prompt produces 2.6GB. Even with quantized KV in int8 or fp8, you are moving hundreds of megabytes per request.
That is the data that has to land on the decode worker's GPU before generation can start. It is large, it is per-request, and it is on the critical path for time-to-first-token.
Why TCP is not the answer
The default networking stack is built around assumptions that do not survive contact with this workload.
A standard TCP send copies the data from user space into kernel buffers, the kernel sends segments and handles acknowledgments, the receiver kernel reassembles the stream into kernel buffers, and the user-space process reads it back out. Every one of those steps involves CPU work and memory copies. Per-message overhead is on the order of tens of microseconds even on a tuned stack. Under any kind of load — concurrent senders, head-of-line blocking, queue buildup — that overhead grows fast.
For moving a one-gigabyte cache between nodes connected by a 100Gbps link, you can saturate the link in theory. In practice, with TCP and standard kernel networking, you will spend most of your latency budget on protocol overhead before the cache has even started moving. The latency floor matters more than the bandwidth ceiling for this workload.
The other half of the problem is the destination. Even if the cache arrives at the receiving host, it then has to copy from host memory to GPU memory before decode can start. That copy is another bandwidth ceiling and another set of memory copies along the way. A naive transfer pipeline easily spends hundreds of milliseconds moving a single cache between nodes — several times the entire TTFT budget for an agent call.
RDMA, briefly
Remote Direct Memory Access is the standard answer. The idea is simple: one side writes directly into the other side's memory, with no kernel involvement on either side. The NIC is told what region of memory to read from and what remote region to write to, and it handles the transfer end to end. No context switches, no kernel buffers, no protocol-level acknowledgment of every segment.
There are two flavors that matter in modern data centers. InfiniBand is native RDMA — separate fabric, dedicated hardware, lowest latency, but expensive to deploy. RoCE, RDMA over Converged Ethernet, runs RDMA semantics over standard Ethernet hardware and is the more common choice in commodity data centers today. Both deliver per-message latency in the single-digit microsecond range and aggregate throughput close to the hardware bandwidth limit.
For our purposes, what matters is that RDMA solves the host-side overhead problem. The CPU is essentially out of the picture. The NIC reads from registered memory regions and writes them to the destination, full stop.
GPU Direct RDMA
The next step is to remove the host RAM hop entirely. GPU Direct RDMA lets the NIC read and write directly to GPU memory without going through host RAM. The combination — RDMA over the network plus GPU Direct on the host — delivers something close to the theoretical minimum overhead for moving GPU-resident data between machines.
This matters more than it might seem. Without GPU Direct, every cache transfer involves two extra copies — GPU to host on send, host to GPU on receive — each bounded by PCIe bandwidth and incurring CPU work to orchestrate. With GPU Direct, the only PCIe traffic is the one-way DMA from GPU to NIC on send, and from NIC to GPU on receive. The host CPU does not touch the data.
For production agentic workloads, GPU Direct RDMA is not optional. The latency floor without it is too high to make disaggregated serving worth doing.
NIXL: the abstraction layer
Knowing about RDMA does not tell you how to use it well. The transports are heterogeneous. The same cluster might have NVLink between GPUs in the same node (fastest, on the order of a microsecond), PCIe between GPUs across NUMA domains, RoCE between nodes in the same rack, and standard TCP across availability zones. A serving system that wants to move KV caches efficiently has to use the best transport available between any given pair of workers.
Writing that logic by hand is the sort of thing that turns into months of work and a maintenance liability. NIXL — the NVIDIA Inference Xfer Library — is the abstraction that handles it. It exposes a uniform API for moving tensors between memory regions and underneath picks the best available transport: NVLink if the source and destination are GPUs in the same node, RDMA over InfiniBand or RoCE if they are in the same fabric, TCP fallback if the topology requires it. NIXL is now the transfer layer used by NVIDIA Dynamo, and it is the right starting point for any team that does not want to build a transport abstraction from scratch.
The point of NIXL is not that it does anything magic. RDMA was already available. NVLink was already available. NIXL's value is that it lets you write transport-agnostic scheduling logic above it, and pick the best transport per pair at runtime. That separation of concerns — scheduling above, transport below — is what makes disaggregated serving tractable to operate.
The topology problem
Having NIXL is not enough. The serving system also needs to make placement decisions that minimize how far KV caches actually travel.
The cluster has a hierarchy. GPUs in the same node share NVLink. Nodes in the same rack share a top-of-rack switch. Racks in the same row share an aggregation switch. Different availability zones share nothing fast. Each step up the hierarchy adds meaningful latency — and the total range, from intra-node NVLink to cross-fabric TCP, spans roughly two orders of magnitude.
A naive disaggregated scheduler routes each request to whichever prefill worker has capacity, and whichever decode worker has capacity, with no awareness of where they are. Caches end up moving across the fabric on every request. Tail latency suffers.
A topology-aware scheduler treats locality as a first-class scheduling constraint. Prefill and decode workers that frequently exchange caches get placed in the same node or rack. Requests with shared prefixes — common in agentic workloads, where many calls share a long system prompt — get routed to prefill/decode pairs that already have warm cache for that prefix. The transfer becomes a delta, not a fresh send.
Three failure modes show up repeatedly in production deployments that have moved to disaggregation but stopped short of getting placement right.
The first is treating the cache transfer purely as a network problem. The team optimizes RDMA throughput, tunes congestion control, validates wire-rate transfers in microbenchmarks — and then watches tail latency stay high in production because the scheduler is sending caches across the fabric when they could have stayed in-rack.
The second is ignoring prefix locality. In agentic serving, the same long system prompt is shared across many concurrent sessions. A prefill worker that has just produced a cache for that prompt should hand off to a decode worker that already has affinity for that prefix — not to a worker on the other side of the cluster that has never seen it. Without this routing, the system loses most of the benefit of prefix caching.
The third is not measuring. Topology-driven tail latency is a per-request property: a single transfer that crosses one extra switch hop adds ten or more microseconds. When the SLO is tight and the request rate is high, those hops accumulate. Teams that rely on aggregate throughput dashboards miss this entirely.
What good looks like
A production-grade disaggregated stack has roughly four properties.
GPU Direct RDMA on every transfer that crosses a node boundary. No host bounces, no CPU in the path. Without this, the floor on transfer latency is too high to make the architecture worth deploying.
A transport abstraction that uses the best link available. NVLink intra-node, RDMA across nodes, graceful TCP fallback when the topology demands it. NIXL is the obvious choice; building this layer in-house is rarely the right use of engineering time.
Topology-aware scheduling. Placement decisions made with knowledge of the physical hierarchy. Prefill/decode pairs co-located when they exchange caches frequently. Prefix-aware routing so that warm-cache benefits cross the disaggregation boundary.
Per-request latency measurement at the transfer layer. Broken down by topology distance, with explicit alerts when the distribution shifts. Aggregate throughput dashboards will not catch the failure modes above; a per-request histogram of transfer latency will.
None of these are exotic. Every component is available off the shelf or in open-source serving stacks. What is rare is the integration: making them work together under realistic concurrent load, measuring the right things, and tuning the placement policy as workloads evolve.
The economics
The reason this matters operationally is that tail latency on the cache transfer translates directly to GPU utilization. Slow transfers serialize the pipeline. A decode worker that is waiting for a cache is a decode worker that is not generating tokens — even though it is fully booked from a scheduling perspective.
The throughput numbers from the disaggregation literature — 2x to 4x improvement over co-located serving — assume the transfer problem is solved. In practice, the difference between a topology-aware deployment and a topology-unaware one is often another 1.5x to 2x on top of that. The compounded effect is the difference between disaggregated serving being a clear win and being a wash.
Closing
Disaggregated inference is moving from research papers to production at the largest LLM serving operators. The pattern is settled: prefill and decode run on different workers, the KV cache moves between them, and RDMA-class transports plus topology-aware scheduling make the math work.
The teams that get this right will run the same workload on meaningfully less hardware than the teams that do not. They will have tighter tail latency. They will be able to serve agents that the co-located stacks cannot serve at all under load.
There is no getting this right by accident. The mechanics are specific. The topology decisions matter. The measurement has to be built in from the start. But it is a tractable engineering problem — and the teams that treat it as such will pull ahead of the teams that treat disaggregation as a deployment-flag flip.
References & further reading
- NIXL — NVIDIA Inference Xfer Library, the transport abstraction underneath Dynamo.
- NVIDIA Dynamo — distributed inference framework with native disaggregation support.
- GPUDirect RDMA Documentation, NVIDIA developer docs.
- Zhong et al., DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving, OSDI 2024 — Section 5 covers transfer mechanics.
- Patel et al., Splitwise: Efficient Generative LLM Inference Using Phase Splitting, ISCA 2024 — discusses transfer across heterogeneous machines.