Prefill vs Decode: Understanding LLM Inference Latency

A clear explanation of the two distinct latency phases in LLM inference—Prefill and Decode—helping developers diagnose and optimize AI response times.

When users complain about slow AI responses, they often experience two different types of delay: the initial wait before the model starts replying, and the slow token-by-token generation after it begins. These correspond to two distinct phases in LLM inference: Prefill and Decode. Prefill processes the entire input prompt in parallel, computing key-value caches for the attention mechanism. Decode generates output tokens one at a time, each step requiring a full forward pass. Understanding this distinction is crucial for optimizing inference pipelines. Techniques like continuous batching, speculative decoding, and KV-cache management target these phases differently. For developers building AI applications, knowing whether latency is dominated by Prefill or Decode helps in choosing the right optimization strategies, such as prompt compression for Prefill-heavy workloads or model quantization for Decode-heavy scenarios. This foundational knowledge is essential for anyone deploying LLMs in production.