LLM Training Communication Overhead Explained: All-Reduce, Tensor Parallelism, Pipeline Parallelism, and ZeRO Formula Cheat Sheet

This article focuses on communication overhead in large model training. It systematically explains the formulas, bottlenecks, and selection strategy for collective communication primitives, data parallelism, tensor parallelism, pipeline parallelism, and ZeRO, addressing a common engineering pain point: memory is sufficient, but training is still slow, and parallelism strategies are hard to quantify. Keywords: All-Reduce, ZeRO, Tensor Parallelism.

Technical Specification Snapshot

Parameter Details
Topic Communication overhead in distributed LLM training
Language Python / Markdown / Mathematical formulas
Core protocols/mechanisms NCCL, All-Reduce, P2P, All-Gather, Reduce-Scatter
Star count Not provided in the original AI-Compass repository text
Core dependencies PyTorch, Megatron-LM, DeepSpeed, GPipe
Intended audience LLM training engineers, distributed systems developers, interview candidates

This article provides a complete framework for calculating communication overhead

The core value of the original content is not a list of concepts. It establishes a unified calculation framework: use Φ for parameter-related communication, use b×s×h for activation-related communication, and stack strategy combinations across the DP, TP, PP, and ZeRO dimensions.

In real-world engineering, communication issues usually appear later than memory issues. A model that runs successfully does not necessarily deliver reasonable throughput. Low GPU utilization and poor cross-node scaling usually point to a communication wall rather than insufficient compute.

Start by memorizing these four high-frequency formulas

# Data parallelism: synchronize gradients after backpropagation
C_dp = 2 * Phi  # Per-GPU communication volume

# Tensor parallelism: per-layer communication volume
C_tp_layer = 8 * b * s * h  # Convert according to byte precision

# Pipeline parallelism: adjacent stages exchange activations and gradients
C_pp = 2 * (N - 1) * b * s * h

# ZeRO-3: parameter sharding introduces extra parameter gathering
C_zero3 = 3 * Phi

These four formulas cover most interview questions and training-plan estimation scenarios.

Collective communication primitives are the building blocks of all parallel strategies

Distributed training is not mysterious at the systems level. At its core are eight collective communication primitives: Broadcast, Scatter, Gather, Reduce, All-Gather, Reduce-Scatter, All-Reduce, and All-to-All.

The most important one is All-Reduce because gradient synchronization in data parallelism depends on it. It can also be decomposed into Reduce-Scatter + All-Gather, which directly explains why ZeRO-1 and ZeRO-2 have the same communication volume as standard data parallelism.

You can remember the eight primitives like this

Primitive Pattern Reduction Typical scenario
Broadcast 1→N No Parameter initialization
Scatter 1→N No Data shard distribution
Gather N→1 No Result collection
Reduce N→1 Yes Aggregation on the primary node
All-Gather N→N No Parameter reconstruction in ZeRO-3
Reduce-Scatter N→N Yes Sharded gradient aggregation in ZeRO
All-Reduce N→N Yes Gradient synchronization in data parallelism
All-to-All N→N No Expert exchange in MoE

The communication bottleneck in data parallelism comes from gradient synchronization, not the forward pass

In data parallelism, each GPU keeps a full replica of the model. Data is partitioned, but parameters are not. The forward pass and backward pass run locally, and communication happens only during gradient synchronization. As a result, communication frequency is low, but each synchronization transfers a large amount of data.

A naive All-Reduce uses one node to collect data and then broadcast the result. The total communication volume can be written as 2×N×Φ, but this introduces an obvious single-node bottleneck. In practice, engineers almost always replace it with Ring-All-Reduce.

Ring-All-Reduce matters because it balances the communication load

def ring_all_reduce_comm(Phi, N):
    # Reduce-Scatter phase: N-1 rounds, Phi/N per round
    reduce_scatter = (N - 1) * Phi / N
    # All-Gather phase: also N-1 rounds, Phi/N per round
    all_gather = (N - 1) * Phi / N
    return 2 * (N - 1) * Phi / N  # Total per-GPU communication volume

This formula shows that the per-GPU communication volume is 2×(N-1)×Φ/N, which approaches 2Φ when N is large enough.

Tensor parallelism communicates more frequently and scales with activation size

Tensor parallelism does not synchronize gradient replicas. Instead, it splits the parameter matrix inside a layer across multiple GPUs for joint computation. The trade-off is that communication no longer occurs only at the end of a training step. It becomes embedded in every forward and backward pass of each layer.

A typical Megatron-LM design splits the first FFN matrix by columns and the second by rows. As a result, the forward pass requires one All-Reduce to aggregate outputs, and the backward pass requires one All-Reduce to aggregate input gradients.

You can directly memorize the communication result for a single Transformer layer

def transformer_layer_comm(b, s, h, bytes_per_elem=2):
    # FFN: 1 All-Reduce in forward + 1 All-Reduce in backward
    c_ffn = 4 * b * s * h * bytes_per_elem
    # Attention: same order of magnitude structurally
    c_attn = 4 * b * s * h * bytes_per_elem
    return c_ffn + c_attn  # Total communication volume for one layer

The result is a per-layer communication volume of 8×b×s×h. Functionally, this code estimates the activation communication cost introduced by TP at each layer.

Pipeline parallelism reduces per-device memory pressure by splitting layers

Pipeline parallelism splits the model by layers across different GPUs. After the upstream stage completes its forward pass, it sends activations to the downstream stage. During backpropagation, gradients are sent back in the opposite direction. This communication pattern is not collective communication; it is point-to-point communication between adjacent stages.

Because of that, pipeline parallelism usually has more controllable communication overhead than tensor parallelism, and its formula is more direct: the total communication volume is 2×(N-1)×b×s×h, where N-1 is the number of partition boundaries.

GPipe addresses bubbles but does not change the total communication formula

GPipe splits a large batch into multiple micro-batches so different stages can work concurrently, reducing pipeline bubble time. It improves device utilization, not the total number of bytes transferred.

ZeRO shards redundant states instead of providing free memory savings

ZeRO-1 shards only optimizer states. ZeRO-2 also shards gradients. ZeRO-3 further shards parameters. The key difference is whether the full parameters must be temporarily reconstructed during the forward and backward passes.

That is why the per-GPU communication volume for ZeRO-1 and ZeRO-2 remains 2Φ: Reduce-Scatter + All-Gather is equivalent to one All-Reduce. Only ZeRO-3 increases communication to 3Φ because of the extra parameter gathering step.

The trade-offs across the three stages fit into one table

Stage Sharded objects Per-GPU communication volume Engineering characteristics
ZeRO-1 Optimizer states Low implementation overhead
ZeRO-2 Optimizer states + gradients Better memory savings
ZeRO-3 Parameters + gradients + optimizer states Lowest memory usage, but heavier communication

3D parallelism is the mainstream answer for training very large models in production

A single strategy rarely satisfies memory, throughput, and scalability requirements at the same time. Production systems usually adopt 3D parallelism: TP handles intra-layer partitioning on high-bandwidth intra-node links, PP handles inter-layer partitioning, and DP scales throughput through replicated workers.

At that point, communication must also be understood by dimension: TP communicates most frequently and should use NVLink whenever possible; PP transfers activations only between neighboring layers; DP has the lowest frequency and is best placed on cross-node networks.

A 3D parallelism estimation template

def comm_3d(Phi, b, s, h, d_tp, d_pp, bytes_per_elem=2):
    c_tp = 8 * b * s * h * bytes_per_elem            # TP communication per layer
    c_pp = 2 * (d_pp - 1) * b * s * h * bytes_per_elem
    c_dp = 2 * Phi / (d_tp * d_pp) * bytes_per_elem  # Gradient synchronization within the DP group
    return c_tp, c_pp, c_dp

This code separates communication estimates for TP, PP, and DP, which makes topology mapping and bottleneck analysis easier.

The image mainly conveys sharing UI behavior rather than communication topology

WeChat sharing prompt

This image is an animated sharing prompt. It does not contain training topology or communication mechanism information, so no technical interpretation is needed.

Strategy selection should begin by locating whether the bottleneck is in parameters, activations, or interconnect bandwidth

If the model fits on a single GPU, start with data parallelism. If optimizer states and gradients dominate memory pressure, prioritize ZeRO-1 or ZeRO-2. If even a single layer does not fit, introduce tensor parallelism. Consider pipeline parallelism when you need layer-wise partitioning across nodes.

One practical rule is: TP within a node, PP across neighboring stages, DP across nodes. The more frequently a strategy communicates, the more it should occupy the highest-bandwidth links.

FAQ

Q1: Why is tensor parallelism often considered more bandwidth-hungry than data parallelism?

A: Because data parallelism usually synchronizes gradients only once after backpropagation in each step, while tensor parallelism communicates during both the forward and backward pass of every layer. Its communication frequency is much higher, and the payload is activations, which grow quickly with b, s, and h.

Q2: Is ZeRO-3 always better than ZeRO-2?

A: Not necessarily. ZeRO-3 delivers the largest memory savings, but it increases communication volume from 2Φ to 3Φ. If network bandwidth is limited or parameter gathering cannot be effectively hidden, actual throughput may be worse than ZeRO-2.

Q3: How would you explain Ring-All-Reduce in one sentence in an interview?

A: It splits All-Reduce into two phases, Reduce-Scatter and All-Gather, and uses ring-based transfers so every GPU shares the communication load evenly, eliminating the primary-node bottleneck.

References and project entry points

The original content comes from the AI-Compass series and serves as a unified knowledge base for LLM training, distributed systems, and interview preparation.

AI Readability Summary: This article systematically reconstructs the knowledge framework for communication overhead in large model training. It covers the eight major collective communication primitives, as well as the core formulas, use cases, and engineering trade-offs for data parallelism, tensor parallelism, pipeline parallelism, ZeRO, and 3D parallelism, helping developers quickly answer interview questions and design training architectures.

AI Visual Insight: The included image is only a sharing prompt animation and does not depict a training topology, communication path, or data flow. The article’s real visual value comes from its formula-based decomposition of communication into parameter traffic and activation traffic, which acts as a conceptual map for analyzing distributed training bottlenecks.