C++ Atomic Operations Explained: Memory Ordering Differences for store, load, and RMW on x86 and ARM64 - Devuly | Smart Analytics for Developers & Projects

This article focuses on the hardware behavior of C++ atomic store, load, and RMW operations. It explains store buffers, memory ordering, and cross-architecture visibility to help developers correctly understand relaxed, acquire/release, and seq_cst. Keywords: concurrent programming, memory ordering, x86, ARM.

Table of Contents

Technical Specification Snapshot

Parameter	Description
Primary Languages	C++ / Assembly
Target Architectures	x86_64, ARM64
Core Topics	`store`, `load`, RMW, memory order
Core Protocols / Models	TSO, ARMv8 Weak Memory Model, MESI
Core Dependencies	`std::atomic`, GCC 15.2, Compiler Explorer
Source Format	Principle-based article on concurrent programming
Star Count	Not provided in the original article

This article shows that visibility matters more than execution for atomic operations

The easiest mistake in concurrent programming is to assume that once the CPU has executed a store, other cores must already be able to observe the result. In practice, many surprising behaviors happen not because an instruction failed to run, but because the write first entered the store buffer and has not yet reached the cache coherence system.

That is the starting point for understanding store, load, and RMW. A store may be delayed before it becomes globally visible, while an RMW operation usually completes the modification only after it has obtained exclusive ownership of the cache line, which gives it stronger semantics.

The store buffer explains why Store-Load reordering can happen

The store buffer is a CPU microarchitectural component, not part of the L1/L2/L3 cache hierarchy. To avoid stalling while waiting for cache-line ownership, a regular write often enters the buffer first and is flushed to the cache asynchronously. This is exactly why a write may have happened locally while remaining invisible to other cores.

std::atomic
<int> x{0}, y{0};
int a = 0, b = 0;

// Thread 1
x.store(1, std::memory_order_relaxed); // Write x first; it may only enter the store buffer
b = y.load(std::memory_order_relaxed); // Then read y; this may complete before the write is globally visible

// Thread 2
y.store(1, std::memory_order_relaxed); // Write y first; it may only enter the store buffer
a = x.load(std::memory_order_relaxed); // Then read x; it may still observe the old value

This code shows that both threads may read . The root cause is not a mistake in source-level ordering, but the fact that the hardware allows Store-Load reordering.

RMW operations are more than read and write-back because they are transactions on an exclusive cache line

For RMW operations such as fetch_add, the key difference from a regular load is not whether the operation can read the latest value. The real difference is whether it prevents other cores from concurrently modifying the same cache line.

A regular load only observes state. An RMW operation issues an RFO request, asks for exclusive ownership, and holds the relevant cache line until the full read-modify-write sequence completes. That is why its synchronization impact is far greater than that of a normal read.

The difference between RMW and store appears in whether delayed visibility is allowed

If a regular store has not yet obtained ownership of the cache line, the data may remain temporarily in the store buffer. An RMW operation, by contrast, performs the modification during the exclusive-ownership phase, so the write lands directly on the cache line and becomes globally visible when the operation completes.

std::atomic<long long> data{0};

auto old = data.fetch_add(1, std::memory_order_acq_rel); // Atomic read-modify-write; the whole operation is indivisible

The value of this line is not just that it increments by one. It guarantees that reading the old value, computing the new value, and publishing the written result take effect externally as one atomic transaction.

The x86_64 architecture naturally provides relatively strong ordering guarantees

x86_64 uses TSO. It forbids Load-Load, Load-Store, and Store-Store reordering, but it allows Store-Load reordering. As a result, many release and acquire semantics can be implemented in hardware with an ordinary mov.

This explains why store(relaxed) and store(release) often both compile to movq, while load(relaxed) and load(acquire) also often compile to movq. The main difference is usually not in the CPU, but in whether the compiler may reorder source statements.

Identical assembly on x86 does not imply identical semantics

int data = 0;
std::atomic
<int> flag{0};

void producer_release() {
    data = 42; // Regular write; it must stay before the release store
    flag.store(1, std::memory_order_release); // Publish the data
}

void producer_relaxed() {
    data = 42; // Regular write
    flag.store(1, std::memory_order_relaxed); // The compiler may apply weaker ordering constraints
}

This example shows that even if both cases eventually lower to mov, release still constrains the compiler, while relaxed may allow independent statements to move earlier or later.

seq_cst on x86 requires an extra mechanism to close the Store-Load gap

seq_cst requires more than acquire or release semantics. It also requires that all seq_cst atomic operations participate in one single global order. The native ordering of x86 is not always sufficient to guarantee this automatically, because Store-Load reordering may still break that total order.

For that reason, store(seq_cst) is often compiled into xchgq with implicit lock semantics. This acts as a full memory barrier, drains the store buffer, and prevents both forward and backward crossings. By contrast, load(seq_cst) often remains a mov, because placing more of the cost on the write side is usually more efficient.

A full memory barrier is a bidirectional barrier

lock xadd qword ptr [rdi], rax ; Lock and atomically add, while also forming a full barrier

This class of instruction guarantees that earlier reads and writes cannot move past it, and later reads and writes cannot move ahead of it. That makes it suitable for establishing one single total order.

The ARM64 architecture expresses memory ordering through dedicated acquire/release instructions

ARM64 uses a weak memory model and does not provide strong default ordering in the same way as x86. store(relaxed) and load(relaxed) usually map to str and ldr; store(release) maps to stlr, and load(acquire) maps to ldar.

More importantly, the stlr + ldar combination in ARMv8, together with multi-copy atomicity, is sufficient in many scenarios to support the single total order required by seq_cst. This is a key capability in the ARMv8 memory model design.

RMW on ARM relies on LL/SC loops instead of a lock prefix

.Lretry:
    ldxr x2, [x1]      // Exclusively read the old value
    add  x3, x2, 1     // Compute the new value
    stxr w4, x3, [x1]  // Attempt the exclusive write-back; returns non-zero on failure
    cbnz w4, .Lretry   // Retry if a concurrent conflict occurred

This assembly sequence shows how ARM implements atomic addition: it completes the RMW through an exclusive load/store loop rather than depending on a single locked instruction.

Choosing a memory order is fundamentally about choosing the boundary of constraints

relaxed gives you atomicity without ordering. release/acquire is for publish-subscribe patterns. seq_cst is usually the easiest model for cross-thread reasoning, but it typically costs more.

If you only need to increment a counter and do not depend on ordering relationships, relaxed is often enough. If you need to publish that object initialization has completed, use release together with acquire. If the system contains complex cross-thread observation relationships, seq_cst is the safer choice.

The supplemental images show that the original page included a brand image and a sharing prompt

WeChat sharing prompt AI Visual Insight: This animated image shows an interaction prompt for the web-based sharing entry point. It emphasizes that users trigger sharing from the top-right area of the page. This is UI guidance information and does not carry technical details about concurrency or low-level architecture.

FAQ

1. Why do `release store` and `relaxed store` on x86 often generate the same `mov` instruction?

Because x86 TSO already forbids Store-Store and Load-Store reordering, the hardware naturally satisfies the ordering required by release semantics. However, the compiler-level behavior still differs: release prevents source-level reordering.

2. Why is RMW “heavier” than a normal load/store?

Because RMW is not just a read followed by a write-back. It must obtain exclusive ownership of the cache line and guarantee that no other core can interleave modifications into the full read-modify-write sequence. It therefore usually carries stronger synchronization semantics.

3. When should you prefer `seq_cst`?

When you cannot confidently prove cross-thread observation order, or when the system contains a complex reasoning chain across multiple atomic variables, seq_cst can significantly reduce cognitive overhead, at the cost of higher synchronization expense.

Core Summary

This article reconstructs the behavior of store, load, and RMW in C++ concurrency by focusing on store buffers, Store-Load reordering, the difference between x86 TSO and ARM weak memory models, and the actual assembly-level realization of seq_cst, acquire, and release.