This article focuses on communication overhead in large model training. It systematically explains the formulas, bottlenecks, and selection strategy for collective communication primitives, data parallelism, tensor parallelism, pipeline parallelism, and ZeRO, addressing a common engineering pain point: memory is sufficient, but training is still slow, and parallelism strategies are hard to quantify. Keywords: All-Reduce, ZeRO, Tensor Parallelism.
Technical Specification Snapshot
| Parameter | Details |
|---|---|
| Topic | Communication overhead in distributed LLM training |
| Language | Python / Markdown / Mathematical formulas |
| Core protocols/mechanisms | NCCL, All-Reduce, P2P, All-Gather, Reduce-Scatter |
| Star count | Not provided in the original AI-Compass repository text |
| Core dependencies | PyTorch, Megatron-LM, DeepSpeed, GPipe |
| Intended audience | LLM training engineers, distributed systems developers, interview candidates |
This article provides a complete framework for calculating communication overhead
The core value of the original content is not a list of concepts. It establishes a unified calculation framework: use Φ for parameter-related communication, use b×s×h for activation-related communication, and stack strategy combinations across the DP, TP, PP, and ZeRO dimensions.
In real-world engineering, communication issues usually appear later than memory issues. A model that runs successfully does not necessarily deliver reasonable throughput. Low GPU utilization and poor cross-node scaling usually point to a communication wall rather than insufficient compute.
Start by memorizing these four high-frequency formulas
# Data parallelism: synchronize gradients after backpropagation
C_dp = 2 * Phi # Per-GPU communication volume
# Tensor parallelism: per-layer communication volume
C_tp_layer = 8 * b * s * h # Convert according to byte precision
# Pipeline parallelism: adjacent stages exchange activations and gradients
C_pp = 2 * (N - 1) * b * s * h
# ZeRO-3: parameter sharding introduces extra parameter gathering
C_zero3 = 3 * Phi
These four formulas cover most interview questions and training-plan estimation scenarios.
Collective communication primitives are the building blocks of all parallel strategies
Distributed training is not mysterious at the systems level. At its core are eight collective communication primitives: Broadcast, Scatter, Gather, Reduce, All-Gather, Reduce-Scatter, All-Reduce, and All-to-All.
The most important one is All-Reduce because gradient synchronization in data parallelism depends on it. It can also be decomposed into Reduce-Scatter + All-Gather, which directly explains why ZeRO-1 and ZeRO-2 have the same communication volume as standard data parallelism.
You can remember the eight primitives like this
| Primitive | Pattern | Reduction | Typical scenario |
|---|---|---|---|
| Broadcast | 1→N | No | Parameter initialization |
| Scatter | 1→N | No | Data shard distribution |
| Gather | N→1 | No | Result collection |
| Reduce | N→1 | Yes | Aggregation on the primary node |
| All-Gather | N→N | No | Parameter reconstruction in ZeRO-3 |
| Reduce-Scatter | N→N | Yes | Sharded gradient aggregation in ZeRO |
| All-Reduce | N→N | Yes | Gradient synchronization in data parallelism |
| All-to-All | N→N | No | Expert exchange in MoE |
The communication bottleneck in data parallelism comes from gradient synchronization, not the forward pass
In data parallelism, each GPU keeps a full replica of the model. Data is partitioned, but parameters are not. The forward pass and backward pass run locally, and communication happens only during gradient synchronization. As a result, communication frequency is low, but each synchronization transfers a large amount of data.
A naive All-Reduce uses one node to collect data and then broadcast the result. The total communication volume can be written as 2×N×Φ, but this introduces an obvious single-node bottleneck. In practice, engineers almost always replace it with Ring-All-Reduce.
Ring-All-Reduce matters because it balances the communication load
def ring_all_reduce_comm(Phi, N):
# Reduce-Scatter phase: N-1 rounds, Phi/N per round
reduce_scatter = (N - 1) * Phi / N
# All-Gather phase: also N-1 rounds, Phi/N per round
all_gather = (N - 1) * Phi / N
return 2 * (N - 1) * Phi / N # Total per-GPU communication volume
This formula shows that the per-GPU communication volume is 2×(N-1)×Φ/N, which approaches 2Φ when N is large enough.
Tensor parallelism communicates more frequently and scales with activation size
Tensor parallelism does not synchronize gradient replicas. Instead, it splits the parameter matrix inside a layer across multiple GPUs for joint computation. The trade-off is that communication no longer occurs only at the end of a training step. It becomes embedded in every forward and backward pass of each layer.
A typical Megatron-LM design splits the first FFN matrix by columns and the second by rows. As a result, the forward pass requires one All-Reduce to aggregate outputs, and the backward pass requires one All-Reduce to aggregate input gradients.
You can directly memorize the communication result for a single Transformer layer
def transformer_layer_comm(b, s, h, bytes_per_elem=2):
# FFN: 1 All-Reduce in forward + 1 All-Reduce in backward
c_ffn = 4 * b * s * h * bytes_per_elem
# Attention: same order of magnitude structurally
c_attn = 4 * b * s * h * bytes_per_elem
return c_ffn + c_attn # Total communication volume for one layer
The result is a per-layer communication volume of 8×b×s×h. Functionally, this code estimates the activation communication cost introduced by TP at each layer.
Pipeline parallelism reduces per-device memory pressure by splitting layers
Pipeline parallelism splits the model by layers across different GPUs. After the upstream stage completes its forward pass, it sends activations to the downstream stage. During backpropagation, gradients are sent back in the opposite direction. This communication pattern is not collective communication; it is point-to-point communication between adjacent stages.
Because of that, pipeline parallelism usually has more controllable communication overhead than tensor parallelism, and its formula is more direct: the total communication volume is 2×(N-1)×b×s×h, where N-1 is the number of partition boundaries.
GPipe addresses bubbles but does not change the total communication formula
GPipe splits a large batch into multiple micro-batches so different stages can work concurrently, reducing pipeline bubble time. It improves device utilization, not the total number of bytes transferred.
ZeRO shards redundant states instead of providing free memory savings
ZeRO-1 shards only optimizer states. ZeRO-2 also shards gradients. ZeRO-3 further shards parameters. The key difference is whether the full parameters must be temporarily reconstructed during the forward and backward passes.
That is why the per-GPU communication volume for ZeRO-1 and ZeRO-2 remains 2Φ: Reduce-Scatter + All-Gather is equivalent to one All-Reduce. Only ZeRO-3 increases communication to 3Φ because of the extra parameter gathering step.
The trade-offs across the three stages fit into one table
| Stage | Sharded objects | Per-GPU communication volume | Engineering characteristics |
|---|---|---|---|
| ZeRO-1 | Optimizer states | 2Φ | Low implementation overhead |
| ZeRO-2 | Optimizer states + gradients | 2Φ | Better memory savings |
| ZeRO-3 | Parameters + gradients + optimizer states | 3Φ | Lowest memory usage, but heavier communication |
3D parallelism is the mainstream answer for training very large models in production
A single strategy rarely satisfies memory, throughput, and scalability requirements at the same time. Production systems usually adopt 3D parallelism: TP handles intra-layer partitioning on high-bandwidth intra-node links, PP handles inter-layer partitioning, and DP scales throughput through replicated workers.
At that point, communication must also be understood by dimension: TP communicates most frequently and should use NVLink whenever possible; PP transfers activations only between neighboring layers; DP has the lowest frequency and is best placed on cross-node networks.
A 3D parallelism estimation template
def comm_3d(Phi, b, s, h, d_tp, d_pp, bytes_per_elem=2):
c_tp = 8 * b * s * h * bytes_per_elem # TP communication per layer
c_pp = 2 * (d_pp - 1) * b * s * h * bytes_per_elem
c_dp = 2 * Phi / (d_tp * d_pp) * bytes_per_elem # Gradient synchronization within the DP group
return c_tp, c_pp, c_dp
This code separates communication estimates for TP, PP, and DP, which makes topology mapping and bottleneck analysis easier.
The image mainly conveys sharing UI behavior rather than communication topology

This image is an animated sharing prompt. It does not contain training topology or communication mechanism information, so no technical interpretation is needed.
Strategy selection should begin by locating whether the bottleneck is in parameters, activations, or interconnect bandwidth
If the model fits on a single GPU, start with data parallelism. If optimizer states and gradients dominate memory pressure, prioritize ZeRO-1 or ZeRO-2. If even a single layer does not fit, introduce tensor parallelism. Consider pipeline parallelism when you need layer-wise partitioning across nodes.
One practical rule is: TP within a node, PP across neighboring stages, DP across nodes. The more frequently a strategy communicates, the more it should occupy the highest-bandwidth links.
FAQ
Q1: Why is tensor parallelism often considered more bandwidth-hungry than data parallelism?
A: Because data parallelism usually synchronizes gradients only once after backpropagation in each step, while tensor parallelism communicates during both the forward and backward pass of every layer. Its communication frequency is much higher, and the payload is activations, which grow quickly with b, s, and h.
Q2: Is ZeRO-3 always better than ZeRO-2?
A: Not necessarily. ZeRO-3 delivers the largest memory savings, but it increases communication volume from 2Φ to 3Φ. If network bandwidth is limited or parameter gathering cannot be effectively hidden, actual throughput may be worse than ZeRO-2.
Q3: How would you explain Ring-All-Reduce in one sentence in an interview?
A: It splits All-Reduce into two phases, Reduce-Scatter and All-Gather, and uses ring-based transfers so every GPU shares the communication load evenly, eliminating the primary-node bottleneck.
References and project entry points
The original content comes from the AI-Compass series and serves as a unified knowledge base for LLM training, distributed systems, and interview preparation.
- GitHub: https://github.com/tingaicompass/AI-Compass
- Gitee: https://gitee.com/tingaicompass/ai-compass
AI Readability Summary: This article systematically reconstructs the knowledge framework for communication overhead in large model training. It covers the eight major collective communication primitives, as well as the core formulas, use cases, and engineering trade-offs for data parallelism, tensor parallelism, pipeline parallelism, ZeRO, and 3D parallelism, helping developers quickly answer interview questions and design training architectures.
AI Visual Insight: The included image is only a sharing prompt animation and does not depict a training topology, communication path, or data flow. The article’s real visual value comes from its formula-based decomposition of communication into parameter traffic and activation traffic, which acts as a conceptual map for analyzing distributed training bottlenecks.