High-Throughput In-Memory Reconciliation System Design: State Machines, Snapshots, and WAL-Style Replay

For high-volume reconciliation workloads, an in-memory architecture replaces heavy SQL and heavyweight state frameworks with a model built on “memory as the state machine,” single-threaded sequential progression, and replayable file-based recovery. This approach significantly improves throughput while reducing engineering complexity. It addresses three core pain points: slow JOINs, fragmented state, and false positives across reconciliation cycles. Keywords: in-memory reconciliation, state machine, WAL.

Technical Specifications Snapshot

Parameter Description
Domain Reconciliation system architecture design
Implementation Model In-memory state machine + periodic snapshots + file replay
Primary Languages Java (in this article), SQL
Key Protocols / Inputs FTP, HTTP, file import, database export
State Storage In-process memory
Persistence Model Snapshots, database persistence for exception-only unmatched records
Recovery Mechanism Latest snapshot + reconciliation file replay
Alternative Approaches MySQL, Flink
Throughput Class Million-scale per minute (single-machine scale)
Core Dependencies Ordered input, bounded lifecycle, replayable files
Reference Popularity Source article from a publishing platform; no star count provided

Reconciliation Is Fundamentally a Continuously Advancing State Machine

Traditional reconciliation is often reduced to “two tables, one JOIN, and one scheduled job.” That model works under low transaction volume, but its weaknesses quickly surface in high-concurrency and high-accumulation environments.

Reconciliation is not CRUD. It is a continuously advancing state machine. Each transaction should be processed exactly once, state must advance over time, and ordering and determinism matter more than one-off query power.

SQL-Based Approaches Become Increasingly Heavy Over Time

MySQL is good at storing final results, but it is not ideal for carrying complex process-oriented computation. Once reconciliation starts spreading intermediate state across multiple tables, the system slows down under I/O pressure, index maintenance, lock contention, and transactional compensation.

SELECT a.biz_id, a.amount, b.amount
FROM pay_a a
LEFT JOIN pay_b b ON a.biz_id = b.biz_id
WHERE a.stat_day = '2026-04-28'; -- Suitable only for static result comparison, not for continuously advancing state

This kind of SQL works for one-time checks. It does not fit a reconciliation pipeline that must span multiple cycles, recover after failure, and continuously consume incoming data.

Flink Can Handle Reconciliation, but General-Purpose Streaming Power Is Not Always Cost-Effective

Flink is extremely capable in streaming scenarios with multiple sources, out-of-order events, and severe lateness. Its keyBy + state + watermark model is also elegant. But the core challenge in reconciliation is often not “can it compute this,” but rather “how much can it process per unit of infrastructure.”

When unmatched transactions keep accumulating, state remains resident in RocksDB, and checkpoints trigger frequently, disk I/O, state growth, and backpressure become long-term operational costs.

Reconciliation Prioritizes Resource Focus Over Feature Completeness

An in-memory approach does one thing well: it advances reconciliation state along the shortest possible path. It does not aim to provide general-purpose distributed semantics. As a result, under the same machine resources, computation stays more concentrated and latency distribution remains more stable.

while (running) {
    Txn left = leftQueue.pollFirst(); // Fetch the next transaction to reconcile from the head of the left queue
    if (left == null) continue;

    Txn match = rightIndex.remove(left.bizId()); // Look up the matching transaction by unique ID
    if (match != null) {
        reconcileSuccess(left, match); // Eliminate state immediately after a successful match
    } else {
        leftQueue.addLast(left); // If no match exists yet, move it to the tail and wait for later-cycle data
    }
}

This logic demonstrates the core of queue-driven reconciliation: ordered consumption, immediate elimination on match, and deferred retry for unmatched records.

The Core Value of an In-Memory Approach Is Not Raw Speed but a Different State Management Model

The most important idea behind in-memory reconciliation is not “move the database into memory.” It is “memory is the only state machine.” Whatever unmatched transactions currently exist are kept in memory and nowhere else. The system no longer maintains redundant dual writes between the database and memory.

The second key idea is a single writer. One reconciliation task uses exactly one thread to write state. All state changes happen in order, so the design does not need locks or transactional coordination.

Rebuildability Matters More Than Never Losing State

An in-memory architecture does not guarantee that state is never lost. It guarantees that state can be rebuilt after loss. As long as the input data is replayable, the state machine can advance again from a snapshot boundary.

def recover(snapshot, files):
    state = load_snapshot(snapshot)  # Load unmatched transactions from the most recent snapshot
    for file in files:
        for event in replay(file):  # Replay events from reconciliation files in order
            state.apply(event)      # Advance the state machine in the original sequence
    return state

The recovery path is straightforward: load the snapshot, replay the files, and restore the system state.

Reconciliation Files Are Naturally a WAL

In many systems, engineers must design a WAL explicitly. Reconciliation is different. The input files are already factual records of events that happened, with stable ordering and natural replayability. In practice, they function as a business-level WAL.

That means snapshots do not need to store the full transaction history. They only need to store the remaining unmatched state after one or more cycles complete. During recovery, the system loads the snapshot and continues progressing from the file offset after that checkpoint.

This Design Significantly Reduces Persistence Cost

If a snapshot stores only unfinished state, its size is far smaller than the full transaction stream. The database only needs to persist exception results, task configuration, and cycle metadata. It no longer carries the full burden of state progression.

The Queue Model Determines Whether the System Remains Stable and Controllable

The reader thread converts reconciliation files into in-memory events and appends them to the tail of the queue in order. The reconciliation thread continuously consumes from the head and tries to match records. A successful match removes state from both sides. An unmatched record goes back to the tail and waits.

This model assumes that transactions from both sides are roughly ordered. If events are severely out of order, a large amount of data remains unmatched for too long, the queue keeps expanding, and memory usage eventually grows enough to overwhelm the task.

Cross-Cycle Unmatched Records Are Not Automatically Exceptions

Different systems may emit the same business transaction with different timestamps, so it is common for corresponding records to fall into different reconciliation cycles. If a transaction is unmatched within one cycle, that only means “the counterpart has not arrived yet.” It should not immediately be classified as an exception.

A safer strategy is to keep unmatched records from the current cycle and let them continue into the next cycle. Only after they remain unmatched for multiple consecutive cycles should the system escalate them into exception handling.

boolean isException(Txn txn, int unmatchedCycles) {
    return unmatchedCycles >= 3; // Mark as an exception only after multiple unmatched cycles to avoid boundary false positives
}

The value of this rule lies in reducing false positives rather than minimizing time to judgment.

Throughput Comparisons Reveal the Real Boundaries of Three Approaches

Approach Single-Machine Throughput Class Primary Bottleneck
MySQL Tens of thousands per minute I/O, indexes, lock contention
Flink Hundreds of thousands per minute State, checkpoints, backpressure
In-memory Millions per minute Memory capacity, bandwidth, ordering

The in-memory model concentrates resources on sequential computation and state elimination. In scenarios with a bounded lifecycle, replayable input, and roughly ordered data, it can often outperform general-purpose approaches by an order of magnitude.

But It Is Not a Silver Bullet Without Trade-Offs

When all state is concentrated in memory, risk becomes concentrated as well. Partitioning strategy, snapshot frequency, input ordering constraints, and memory reclamation policy directly determine the upper limit of the system.

If the business has too many upstream sources, severe disorder, or strong requirements for distributed fault tolerance and cross-region elasticity, then a general-purpose streaming framework such as Flink may still be the better choice.

Production Implementation Should Satisfy Four Conditions First

First, input files must be replayable. Second, the transaction lifecycle must be bounded, or the state cannot be reclaimed naturally. Third, both sides must share a matchable unique key. Fourth, source tasks and reconciliation cycles must remain strictly synchronized.

Once cycle synchronization breaks, the reconciliation task may still be processing September 15 data while one source has already advanced to September 16. That mixed data can enter the state machine and ultimately contaminate state.

Minimal Implementation Skeleton

class ReconcileTask {
    void runCycle() {
        ingestFiles();      // Read files for the current cycle and write them into the in-memory queue
        reconcileLoop();    // Advance reconciliation state sequentially
        dumpSnapshot();     // Save a snapshot of unmatched transactions
        persistExceptions(); // Persist one-sided records and exception results
    }
}

This skeleton captures the minimal closed loop of in-memory reconciliation: ingest, advance, snapshot, and persist.

In-Memory Reconciliation Is an Architecture Choice That Optimizes for Real-World Constraints

In-memory reconciliation is not mysterious. It simply acknowledges one fact: the core problem in reconciliation is state progression, not table querying. As long as input is replayable, state is reclaimable, and ordering is controllable, there is no need to make the database or a heavyweight streaming framework carry all the complexity.

For reconciliation systems suffering from throughput pressure, uncontrolled cost, and ever-expanding logic, in-memory reconciliation is not a radical replacement. It is an engineering solution that aligns more closely with the nature of the problem.

WeChat Mini Program

AI Visual Insight: The image is a QR-code-style promotional poster for accessing tool services through a mobile entry point. It adds limited direct technical value to the reconciliation architecture discussed in this article.

FAQ

1. What scenarios are best suited for in-memory reconciliation?

It works best for two-sided or multi-sided reconciliation workloads with high throughput, bounded state lifecycles, replayable inputs, and roughly ordered data—especially when a single machine can hold the core working state.

2. Why can reconciliation files serve as a WAL?

Because the files are records of facts that have already occurred, with natural ordering and replayability. Recovery only requires “latest snapshot + file replay,” without building a separate complex logging pipeline.

3. What is the biggest risk of an in-memory approach?

The biggest risk is that all state is concentrated in memory. If partitioning, ordering control, or snapshot strategy is poorly designed, the system can suffer from memory bloat, long recovery times, or state contamination.

Core Summary

This article reconstructs an in-memory reconciliation architecture around the idea that “memory is the state machine.” By using ordered execution with a single writer, treating reconciliation files as a WAL, and storing only unmatched transactions in snapshots, the design achieves high throughput and lower complexity. It also compares the practical boundaries of MySQL, Flink, and in-memory approaches for production reconciliation systems.