Research

Notes from production.

Engineering writing on the systems behind agentic AI — inference, retrieval, evaluation, and the topology of real deployments.

2 posts · Updated April 2026

Inference Systems Part 1 of 2

February 20268 min read

Prefill and Decode Are Not the Same Job

Prefill and decode are the two halves of every transformer inference call, but they place radically different demands on the GPU. We trace why running them on the same hardware leaves throughput on the table — and what production-scale agent serving looks like when you stop pretending they are the same job.

Read the post

Inference Systems Part 2 of 2

April 202611 min read

Moving the Cache: RDMA, NIXL, and the Topology Problem

Once prefill and decode are split onto separate workers, the KV cache becomes the bottleneck. We trace how RDMA and the NIXL protocol move that state across racks at line rate — and why most production deployments still get the topology wrong.

Read the post

Working on something similar?

If your team is moving agentic AI from pilot to production and running into the systems problems we write about here, we would like to hear what you are working on.

Start a conversation