How Distributed Workflows Achieve Exactly-Once: A Three-Layer Concurrency Control Model with Optimistic Locking, Job Claiming, and Redis Locks

The core goal of a distributed workflow engine is to ensure that process state advances successfully only once across a cluster. It primarily addresses three pain points: duplicate user submissions, repeated execution of scheduled jobs, and uncontrolled external side effects. Keywords: workflow engine, Exactly-Once, concurrency control.

Technical Specification Snapshot

Parameter Value
Core topic Distributed workflow concurrency control
Typical engines Flowable, Camunda
Runtime language Java
Core protocols/interactions HTTP, database transactions, Redis atomic operations
Core dependencies MyBatis, relational database, Redis, JobExecutor
Key fields REV_, LOCK_OWNER_, LOCK_EXP_TIME_
Star count Not provided in the original article

Distributed workflows in clustered environments require multiple layers of concurrency defense

In a microservices cluster, a workflow engine does not run as a single execution point. Multiple application nodes can access it at the same time. As long as load balancing, automatic retries, and background scheduling exist, concurrency contention is unavoidable.

The two most common conflicts are these: a user submits the same approval task twice in quick succession, and multiple nodes scan and execute the same scheduled job at the same time. The former can corrupt the process pointer, while the latter can trigger duplicate messages, emails, or compensation logic.

User-side concurrency and background concurrency are fundamentally different

User request contention is usually instantaneous, relatively infrequent, and directly visible. Background job contention is proactive and high-frequency because every node continuously polls for runnable jobs.

For that reason, workflow systems usually adopt a three-layer defense model: the business entry layer intercepts duplicate requests first, the engine backend acquires tasks exclusively through a claim mechanism, and the database provides a final consistency safeguard through optimistic locking.

Business entry layer -> Redis distributed lock for deduplication
Engine scheduling layer -> Job Lock and Claim for exclusive ownership
Database persistence layer -> `REV_` optimistic locking as the final safeguard

This structure summarizes the order of defense in a distributed workflow system: outer layers prioritize user experience, while inner layers prioritize consistency.

Optimistic locking is the consistency foundation of workflow runtime tables

Workflow engine runtime tables usually include a REV_ field that represents the current version of a record. Whenever the process advances by updating a task, an execution flow, or an instance state, the engine uses the old version as an update condition.

When two nodes read the same task record at the same time, they may see the same version number. However, only the node that commits its update first can succeed. The later writer fails because the update condition no longer matches.

REV_-based update statements naturally arbitrate write contention

UPDATE ACT_RU_TASK
SET ASSIGNEE_ = 'userA', REV_ = 2
WHERE ID_ = 'task_123'
  AND REV_ = 1; -- Allow the update only when the old version is still 1

This SQL statement ensures that only one transaction can successfully modify the same task record by comparing version numbers.

If node A has already changed REV_ from 1 to 2, then node B will affect 0 rows when it later submits an update with REV_=1. At that point, MyBatis or the engine framework throws an optimistic locking exception and rolls back the transaction.

Optimistic locking guarantees a uniquely successful outcome

It does not prevent concurrency from happening. Instead, it decides at commit time which write succeeds. This mechanism is performant and uses fine-grained locking behavior, which makes it well suited for low-frequency but real contention such as approval clicks and API retries.

However, it also has limits. By the time the exception occurs, the request may already be deep inside the business logic. If the system has already called an external service, rolling back the database cannot undo the external side effect. That is why optimistic locking works best as a consistency fallback rather than the first line of defense for user experience.

The task claiming mechanism resolves proactive contention in JobExecutor

Unlike user requests, asynchronous jobs and scheduled jobs are actively pulled by backend engine threads. The JobExecutor on multiple nodes can scan the same due job at the same time, so the system must enforce mutual exclusion during the task acquisition phase.

To do this, the engine usually maintains two fields in the job table: LOCK_OWNER_ and LOCK_EXP_TIME_. One records which node owns the lock, and the other records when the lock expires.

Lock and Claim completes exclusive ownership before execution begins

UPDATE ACT_RU_TIMER_JOB
SET LOCK_OWNER_ = 'node-A',
    LOCK_EXP_TIME_ = '2026-04-23 10:05:00'
WHERE ID_ = 'job_456'
  AND LOCK_OWNER_ IS NULL; -- Allow only unclaimed jobs to be acquired by the current node

This SQL statement lets a node claim a task first and then place it into its local thread pool for execution, which prevents multiple nodes from consuming the same job repeatedly.

The mechanism relies on the exclusivity of the database update. Even if multiple nodes attempt to claim the job at the same time, the database allows only one node to update the row successfully. The winner gets execution rights, while the losers continue scanning for other jobs.

Lock expiration ensures task recovery after a node failure

If a node crashes after claiming a task, the task must not remain suspended forever. For that reason, the engine periodically scans records whose LOCK_EXP_TIME_ has passed and allows other nodes to reclaim them.

This design gives the system both mutually exclusive execution and failure recovery. The first prevents duplicate execution, while the second handles node failure. Together, they are essential to stable backend scheduling in distributed workflow systems.

Adding a Redis distributed lock at the business entry layer significantly improves safety and user experience

Relying only on internal engine mechanisms is not enough. Before a duplicate request even reaches the workflow API, it may already have triggered business logic with weak idempotency, such as SMS delivery, email sending, payment processing, or third-party approval APIs.

A more practical architecture applies lightweight deduplication by task ID in the Controller, Facade, or application service layer. Redis SET NX EX is well suited for this responsibility.

Entry-layer locks should intercept duplicate clicks and gateway retries first

boolean locked = redisTemplate.opsForValue()
    .setIfAbsent("workflow:task:lock:" + taskId,
        "locked",
        5,
        TimeUnit.SECONDS); // Use a 5-second short lock to intercept instantaneous duplicate submissions for the same task

if (!locked) {
    throw new BizException("Task is being processed. Please do not submit it again."); // Return a user-friendly message and keep the request out of the engine internals
}

This code quickly intercepts duplicate requests at the business entry point, reducing unnecessary computation and the risk of external side effects.

The safest implementation is to define a clear execution order for workflow advancement, business validation, and external calls, and then release the lock in a finally block. You should also combine this with business-level idempotency design to avoid a second execution window after the lock expires.

Exactly-Once becomes more realistic only when all three mechanisms work together

The Redis lock handles early interception, task claiming handles backend mutual exclusion, and optimistic locking performs final consistency arbitration. These are not substitutes for one another. Together, they form a closed-loop design across user experience, scheduling, and storage.

For engines such as Flowable and Camunda, this means that even under frontend jitter, gateway retries, node scaling, or single-node failure, workflow state can remain predictable, recoverable, and auditable.

The correct production approach for distributed workflows is layered design, not betting on a single mechanism

If you rely only on database optimistic locking, you may get the correct final result but not a stable user experience. If you rely only on a Redis lock, you may block some duplicate requests but still fail to prevent contention among backend jobs. If you rely only on task claiming, you still cannot cover user-side concurrency.

A production-ready solution must address entry-point deduplication, exclusive engine scheduling, and database version arbitration at the same time. That is how you turn “execute only once” from a slogan into an engineering capability.

FAQ

FAQ 1: If the workflow engine already uses optimistic locking, why add a Redis lock at the business layer?

Because optimistic locking can guarantee only that exactly one transaction writes successfully in the end. It cannot prevent duplicate requests from triggering external APIs early or causing 500-level errors. The value of a Redis lock is early interception and improved user experience.

FAQ 2: What problem does the combination of LOCK_OWNER_ and LOCK_EXP_TIME_ solve?

It prevents the same backend job from being executed repeatedly by multiple nodes. LOCK_OWNER_ marks task ownership, and LOCK_EXP_TIME_ handles lock recovery after a node crash, ensuring that tasks are not lost permanently.

FAQ 3: Can distributed workflows truly achieve strict Exactly-Once semantics?

In the strict sense, systems can usually only approximate it within bounded limits. Database state can often guarantee a single successful write, but once external systems are involved, you still need idempotency keys, retry compensation, and deduplication logs to achieve engineering-grade Exactly-Once semantics.

AI Readability Summary

This article reconstructs the concurrency control model of distributed workflow engines in microservices clusters. It focuses on three problems: duplicate user submissions, scheduled job contention, and request deduplication at the business entry layer. It then explains how REV_ optimistic locking, the Job Lock and Claim mechanism, and Redis distributed locks work together to approximate Exactly-Once execution semantics.