This article focuses on the hardware behavior of C++ atomic
store,load, and RMW operations. It explains store buffers, memory ordering, and cross-architecture visibility to help developers correctly understandrelaxed,acquire/release, andseq_cst. Keywords: concurrent programming, memory ordering, x86, ARM.
Technical Specification Snapshot
| Parameter | Description |
|---|---|
| Primary Languages | C++ / Assembly |
| Target Architectures | x86_64, ARM64 |
| Core Topics | store, load, RMW, memory order |
| Core Protocols / Models | TSO, ARMv8 Weak Memory Model, MESI |
| Core Dependencies | std::atomic, GCC 15.2, Compiler Explorer |
| Source Format | Principle-based article on concurrent programming |
| Star Count | Not provided in the original article |
This article shows that visibility matters more than execution for atomic operations
The easiest mistake in concurrent programming is to assume that once the CPU has executed a store, other cores must already be able to observe the result. In practice, many surprising behaviors happen not because an instruction failed to run, but because the write first entered the store buffer and has not yet reached the cache coherence system.
That is the starting point for understanding store, load, and RMW. A store may be delayed before it becomes globally visible, while an RMW operation usually completes the modification only after it has obtained exclusive ownership of the cache line, which gives it stronger semantics.
The store buffer explains why Store-Load reordering can happen
The store buffer is a CPU microarchitectural component, not part of the L1/L2/L3 cache hierarchy. To avoid stalling while waiting for cache-line ownership, a regular write often enters the buffer first and is flushed to the cache asynchronously. This is exactly why a write may have happened locally while remaining invisible to other cores.
std::atomic
<int> x{0}, y{0};
int a = 0, b = 0;
// Thread 1
x.store(1, std::memory_order_relaxed); // Write x first; it may only enter the store buffer
b = y.load(std::memory_order_relaxed); // Then read y; this may complete before the write is globally visible
// Thread 2
y.store(1, std::memory_order_relaxed); // Write y first; it may only enter the store buffer
a = x.load(std::memory_order_relaxed); // Then read x; it may still observe the old value
This code shows that both threads may read . The root cause is not a mistake in source-level ordering, but the fact that the hardware allows Store-Load reordering.
RMW operations are more than read and write-back because they are transactions on an exclusive cache line
For RMW operations such as fetch_add, the key difference from a regular load is not whether the operation can read the latest value. The real difference is whether it prevents other cores from concurrently modifying the same cache line.
A regular load only observes state. An RMW operation issues an RFO request, asks for exclusive ownership, and holds the relevant cache line until the full read-modify-write sequence completes. That is why its synchronization impact is far greater than that of a normal read.
The difference between RMW and store appears in whether delayed visibility is allowed
If a regular store has not yet obtained ownership of the cache line, the data may remain temporarily in the store buffer. An RMW operation, by contrast, performs the modification during the exclusive-ownership phase, so the write lands directly on the cache line and becomes globally visible when the operation completes.
std::atomic<long long> data{0};
auto old = data.fetch_add(1, std::memory_order_acq_rel); // Atomic read-modify-write; the whole operation is indivisible
The value of this line is not just that it increments by one. It guarantees that reading the old value, computing the new value, and publishing the written result take effect externally as one atomic transaction.
The x86_64 architecture naturally provides relatively strong ordering guarantees
x86_64 uses TSO. It forbids Load-Load, Load-Store, and Store-Store reordering, but it allows Store-Load reordering. As a result, many release and acquire semantics can be implemented in hardware with an ordinary mov.
This explains why store(relaxed) and store(release) often both compile to movq, while load(relaxed) and load(acquire) also often compile to movq. The main difference is usually not in the CPU, but in whether the compiler may reorder source statements.
Identical assembly on x86 does not imply identical semantics
int data = 0;
std::atomic
<int> flag{0};
void producer_release() {
data = 42; // Regular write; it must stay before the release store
flag.store(1, std::memory_order_release); // Publish the data
}
void producer_relaxed() {
data = 42; // Regular write
flag.store(1, std::memory_order_relaxed); // The compiler may apply weaker ordering constraints
}
This example shows that even if both cases eventually lower to mov, release still constrains the compiler, while relaxed may allow independent statements to move earlier or later.
seq_cst on x86 requires an extra mechanism to close the Store-Load gap
seq_cst requires more than acquire or release semantics. It also requires that all seq_cst atomic operations participate in one single global order. The native ordering of x86 is not always sufficient to guarantee this automatically, because Store-Load reordering may still break that total order.
For that reason, store(seq_cst) is often compiled into xchgq with implicit lock semantics. This acts as a full memory barrier, drains the store buffer, and prevents both forward and backward crossings. By contrast, load(seq_cst) often remains a mov, because placing more of the cost on the write side is usually more efficient.
A full memory barrier is a bidirectional barrier
lock xadd qword ptr [rdi], rax ; Lock and atomically add, while also forming a full barrier
This class of instruction guarantees that earlier reads and writes cannot move past it, and later reads and writes cannot move ahead of it. That makes it suitable for establishing one single total order.
The ARM64 architecture expresses memory ordering through dedicated acquire/release instructions
ARM64 uses a weak memory model and does not provide strong default ordering in the same way as x86. store(relaxed) and load(relaxed) usually map to str and ldr; store(release) maps to stlr, and load(acquire) maps to ldar.
More importantly, the stlr + ldar combination in ARMv8, together with multi-copy atomicity, is sufficient in many scenarios to support the single total order required by seq_cst. This is a key capability in the ARMv8 memory model design.
RMW on ARM relies on LL/SC loops instead of a lock prefix
.Lretry:
ldxr x2, [x1] // Exclusively read the old value
add x3, x2, 1 // Compute the new value
stxr w4, x3, [x1] // Attempt the exclusive write-back; returns non-zero on failure
cbnz w4, .Lretry // Retry if a concurrent conflict occurred
This assembly sequence shows how ARM implements atomic addition: it completes the RMW through an exclusive load/store loop rather than depending on a single locked instruction.
Choosing a memory order is fundamentally about choosing the boundary of constraints
relaxed gives you atomicity without ordering. release/acquire is for publish-subscribe patterns. seq_cst is usually the easiest model for cross-thread reasoning, but it typically costs more.
If you only need to increment a counter and do not depend on ordering relationships, relaxed is often enough. If you need to publish that object initialization has completed, use release together with acquire. If the system contains complex cross-thread observation relationships, seq_cst is the safer choice.
The supplemental images show that the original page included a brand image and a sharing prompt
AI Visual Insight: This animated image shows an interaction prompt for the web-based sharing entry point. It emphasizes that users trigger sharing from the top-right area of the page. This is UI guidance information and does not carry technical details about concurrency or low-level architecture.
FAQ
1. Why do release store and relaxed store on x86 often generate the same mov instruction?
Because x86 TSO already forbids Store-Store and Load-Store reordering, the hardware naturally satisfies the ordering required by release semantics. However, the compiler-level behavior still differs: release prevents source-level reordering.
2. Why is RMW “heavier” than a normal load/store?
Because RMW is not just a read followed by a write-back. It must obtain exclusive ownership of the cache line and guarantee that no other core can interleave modifications into the full read-modify-write sequence. It therefore usually carries stronger synchronization semantics.
3. When should you prefer seq_cst?
When you cannot confidently prove cross-thread observation order, or when the system contains a complex reasoning chain across multiple atomic variables, seq_cst can significantly reduce cognitive overhead, at the cost of higher synchronization expense.
Core Summary
This article reconstructs the behavior of store, load, and RMW in C++ concurrency by focusing on store buffers, Store-Load reordering, the difference between x86 TSO and ARM weak memory models, and the actual assembly-level realization of seq_cst, acquire, and release.