Beam Search is a classic decoding strategy for multi-candidate generation, but in large language model inference engines it amplifies the complexity of KV cache management, scheduling, pruning, and architectural decoupling. This article compares the implementation trade-offs across vLLM, TensorRT-LLM, Transformers, and SGLang, and extracts practical high-performance engineering patterns. Keywords: Beam Search, KV Cache, SGLang.
Technical Specifications Snapshot
| Parameter | Details |
|---|---|
| Topic | Beam Search in LLM inference engines |
| Primary Languages | Python, C++, CUDA |
| Related Protocols | OpenAI-Compatible API, ZMQ |
| Evaluation Model | Qwen3-1.7B |
| Hardware Environment | Single NVIDIA L20 GPU |
| Reference Projects | SGLang, vLLM, TensorRT-LLM, Transformers |
| GitHub Stars | SGLang / vLLM / Transformers / TensorRT-LLM are active open source projects; the original article did not provide exact star counts |
| Core Dependencies | PyTorch, CUDA, TensorRT, Prefix Caching |
The Real Value of Beam Search Lies in Optimal Multi-Result Retention, Not Just Search
The fundamental difference between Beam Search and greedy decoding is that Beam Search keeps more than one token at each step. Instead, it retains the top beam_width paths with the highest cumulative scores. This addresses the tendency of single-path decoding to get trapped in local optima.
In conversational LLMs, Beam Search is not the default mainstream strategy. But in search, recommendation, retrieval, and recall-oriented tasks, multi-candidate outputs align well with business requirements. For that reason, inference engines still need to treat Beam Search as a first-class capability.
AI Visual Insight: This image serves as the article header and introduces the engineering theme of Beam Search in inference engines. It emphasizes that Beam Search is not just a standalone algorithmic issue, but a system-level optimization problem spanning performance, scheduling, caching, and architectural boundaries.
The Core Beam Search Mechanism Can Be Reduced to Parallel Retention and Continuous Pruning
AI Visual Insight: This diagram shows how Beam Search expands multiple candidate paths at each decoding step, then keeps only the top paths based on cumulative log probabilities. It clearly illustrates the core decision loop of local expansion plus global retention.
beam_width = 4
candidates = expand_all_beams(beams) # Expand candidate tokens for all parent beams
scored = score_with_cum_logprob(candidates) # Add current logprob to the historical cumulative score
beams = topk(scored, k=beam_width) # Keep only the highest-scoring beams
This code captures the core Beam Search loop: expansion, score accumulation, and pruning.
The Differences Across Mainstream Inference Engines First Appear in Their Performance Scaling Curves
The original evaluation used Qwen3-1.7B, a single L20 GPU, and a ShareGPT subset. The conclusion is very clear: under small beam widths, SGLang and TensorRT-LLM perform similarly; under large beam widths, SGLang scales more steadily and clearly outperforms vLLM.
AI Visual Insight: This chart compares QPS across engines as beam width increases from 10 to 400. SGLang shows a smoother decline, indicating that it controls request lifecycle overhead, KV reuse, and CPU-side object costs more effectively at large beam widths.
The Difficulty of Beam Search Boils Down to Four Engineering Tensions
- It must not pollute the main path for standard inference.
- It must not let the KV cache spiral out of control during fork and prune operations.
- It must not allow a single request to destabilize batch scheduling.
- It must not let scoring and pruning degrade into CPU-side object loops.
AI Visual Insight: This diagram breaks the engineering problem of Beam Search into four modules: main-path decoupling, KV cache management, request scheduling, and expansion/scoring/pruning. It shows that Beam Search is a cross-cutting systems concern rather than a single-operator optimization problem.
vLLM’s Older Design Treated Beam Search as a First-Class Kernel Capability
In early versions of vLLM, Beam Search was deeply embedded in the main execution path, with SequenceGroup as the core abstraction. The benefit was that fork, pruning, and early stopping could be handled directly inside the engine. The downside was that even ordinary requests with only one sequence still had to pay the extra cost of the group abstraction.
Its KV cache used reference counting plus Copy-on-Write. When prefixes were shared, it did not copy the physical KV data, but only shared block references. When a child sequence kept writing into a shared block, it triggered a real data copy.
if block.ref_count > 1: # Trigger CoW when writing to a shared block
new_block = alloc_block() # Allocate a new physical block
copy_kv(block, new_block) # Copy existing KV data
child.block_table[-1] = new_block # Redirect the child beam to the new block
This logic shows how the older vLLM architecture used CoW to balance sharing against write conflicts.
The Object-Oriented Execution Path Produces Noticeable Waste at Large Beam Widths
The issue is not algorithmic correctness, but execution shape: Python loops, .tolist() synchronization, deepcopy of sequence objects, and per-sequence block table copies all work against GPU-friendly vectorization.
AI Visual Insight: This figure compares the old and new Beam Search architectures in vLLM. The older version embeds Beam Search into the core engine, while the newer version lifts it to the entry layer. This reflects a community-wide shift in trade-offs between performance-first design and architectural simplicity.
vLLM’s Newer Design Achieves Decoupling but Sacrifices Significant Runtime Efficiency
After v0.6.2, vLLM moved Beam Search into entry-layer orchestration. Each beam is treated as an independent ordinary request, generates only one token per step, and then relies on the CPU side for cumulative scoring and selection.
This makes the kernel completely unaware of Beam Search and lowers maintenance cost. But the trade-off is that each step must repeat scheduling, reconstruct requests, rely on prefix caching to locate shared prefixes, and process candidates in CPU-side loops.
Transformers Represents a Pure Tensorized Implementation Pattern
Transformers’ _beam_search() relies very little on object-oriented sequence management. Instead, it performs score broadcasting, index reconstruction, and sequence updates directly on tensors. Its advantage is strong vectorization; its drawback is a lower abstraction level and modest readability.
log_probs = log_probs + running_beam_scores[:, :, None] # Broadcast and add historical scores
next_indices = topk_indices // vocab_size # Recover the parent beam index
next_tokens = topk_indices % vocab_size # Recover the token index
This code shows how a pure tensor implementation compresses Beam Search into broadcast and indexing operations.
TensorRT-LLM Uses a Two-Level Mapping Scheme for KV Cache Inheritance
TensorRT-LLM treats Beam Search as a formal capability inside the main execution path. Its key mechanism is not CoW, but cache_indirection. Each beam owns its own index slot, while the inheritance relationship for historical-step KV is described by a mapping table, and reads use two-level addressing.
AI Visual Insight: This diagram shows the three-layer coordination among physical blocks, the block offset table, and the cache indirection table. Writes locate blocks directly, expansion only updates beam mappings, and reads fetch KV through a two-step beam → source beam → block lookup.
This design avoids CoW when blocks are not yet full, but it increases mapping maintenance cost and can prolong GPU memory residency.
SGLang’s Core Breakthrough Is Preserving Kernel-Level Capability While Switching to Data-Oriented Design
SGLang does not copy the older vLLM object model. Instead, it splits Beam Search state into a combination of tensors and lightweight dataclasses, then isolates the logic into separate files through mixins.
Its strategy can be summarized like this: the main path adds only a small number of if is_beam_search entry checks, while the real complexity of expansion, pruning, KV copying, and response assembly is handled in mixins. This preserves main-path performance without deeply intruding into the standard inference path.
SGLang Shares Prefix KV by Copying Indices and Frees Obsolete Blocks Through Set Difference
When execution transitions from prefill to decode, a request expands from 1 slot to beam_width slots. But multiple beams only copy the indices in req_to_token_pool; they do not copy the physical KV data. After each pruning step, SGLang computes the set difference between the KV blocks used in the previous round and the KV blocks retained in the current round, directly identifying the blocks that can be released.
AI Visual Insight: This figure shows how SGLang shares prefix KV during beam expansion by copying index rows, then identifies orphaned blocks after pruning through set difference. It highlights an implementation style that avoids reference counting and per-object reclamation.
uniques, counts = torch.unique(
torch.cat([last_beam_kv_indices, keep_kv_indices]),
return_counts=True,
)
free_kv_indices = uniques[counts == 1] # Present only in the previous round, so it was pruned
allocator.free(free_kv_indices)
This code reflects SGLang’s key optimization: replacing reference counting with set operations.
SGLang Also Raises Top-k Granularity from Per-Request to Once Per Batch
It first runs a single topk(max_k) over the logits for the entire batch of beams, then slices the result by request. After that, it runs a second top-k only on the reduced beam_width × 2*beam_width matrix. This avoids launching repeated kernels over a huge vocabulary for every request.
max_k = max(req.beam_candidates for req in batch.reqs)
top_tokens_all, top_logprobs_all = logprobs.topk(max_k, dim=1, sorted=True)
# Slice by request afterward, then run a smaller flat top-k
This logic directly explains why SGLang can sustain better throughput even at large beam widths.
This Design Ultimately Comes Down to Data Orientation Instead of Beam-Oriented Objects
SGLang offers a highly reusable lesson: place high-frequency numerical state into flat tensors, keep low-frequency semantic state in lightweight objects, and express relationships through index tables and set operations rather than pointer chains.
AI Visual Insight: This figure emphasizes the difference between data-oriented and object-oriented design. The former organizes state and relationships around access patterns, which makes batch top-k, batch copy, and batch free operations more efficient. The latter is easier to read, but introduces more CPU management overhead at large beam widths.
FAQ
1. Why is Beam Search not commonly used as the default decoding strategy in chat applications?
Because chat scenarios prioritize low-latency single-result output, while Beam Search introduces multi-path state management, KV inheritance, and extra pruning overhead. That increases both throughput cost and implementation complexity.
2. Why does SGLang tend to lead more easily at large beam widths?
Because it reduces object creation, lowers kernel launch counts, and handles KV cache through index copying and set-difference operations, which avoids heavy CPU-side orchestration and per-beam management.
3. What is the most important optimization principle when designing Beam Search?
Prioritize turning expansion, scoring, pruning, and KV management into batched tensor or indexing operations. The less you depend on Python loops and deep object hierarchies, the easier it is to achieve stable performance at large beam widths.
Core Summary: This article systematically breaks down the engineering challenges of Beam Search in LLM inference engines and compares the implementation strategies used by vLLM, TensorRT-LLM, Transformers, and SGLang. It focuses on how KV cache reuse, batch scheduling, pruning and scoring, and data-oriented design affect scalability and runtime performance.